My most memorable bug

It will be ten years this month since my team and I briefly thought we would be involved in an international diplomatic incident.

It was May 2010, and I had recently moved from the Google office in Dublin, Ireland, to the Mountain View, California, headquarters to join the Gadget Server team.

If you remember iGoogle, you surely also remember Gadgets. iGoogle was Google’s customizable web portal. It had some windows called “Gadgets” that users could choose and place anywhere on their iGoogle page. There were Gadgets for reading email, the news, or the weather forecast, and there even was a Gadget to convert between measurement units (meters to feet, pints to liters, etc.) Our team worked on the system that displayed Gadgets to users.

One day, we got a bug report claiming that, in Taiwan, the unit conversion Gadget didn’t appear in the traditional Chinese characters used in Taiwan but in the simplified Chinese characters used in mainland China.

In those days, Google and China weren’t going through the best moment in their relationship: some months before, Google had uncovered a series of hacking attacks from China and announced that it would close its specially censored search engine for China. From that moment on, Chinese users would access the regular uncensored search engine. China was not happy about this, and Google was not pleased either.

With so much tension between Google and China, learning that users in Taiwan had started to see a Gadget as if they were in mainland China added the Gadget Server team to the list of unhappy people. Might someone in China be intercepting Google traffic destined for Taiwan? We hoped not, and even though we didn’t expect it to be the case, we needed to find out what was happening.

The first step when you get a bug report is to try to reproduce it: you can’t investigate a bug you can’t see. However, any way I tried it, I couldn’t reproduce it. I would send requests to see the Gadget as a Taiwanese user, but I would always see it in traditional Chinese characters as if there were no bugs. I tried it every which way with no results.

Then, I sent the same request to every server at the same time and checked if there were any differences between them. Our service got requests from all over the world, so we had servers spread throughout the planet; user requests would arrive automatically at their nearest server. When I did my tests, my requests went to a US server, but requests from Taiwan would go to a server in Asia. In theory, all servers were identical, but what if it turned out they weren’t?

I wrote a web page that sent requests directly to every server, loaded it on my browser, and saw that some servers gave different answers. Most servers in Europe and America responded with the Gadget in traditional Chinese characters, which was the correct result; however, most servers in Asia responded in simplified Chinese.

To add to the mystery, not all servers in each location gave the same result. Some gave the correct answer, others gave the wrong answer, and the ratio between one and the other differed depending on the location.

After much testing, I realized there was an apparent memory effect. I would send a request to display the Gadget in simplified Chinese; then, for the next few minutes, the servers would always respond in simplified Chinese whenever I sent requests for traditional Chinese. It also happened the other way around: I would send a request for traditional Chinese, and the servers would keep responding to simplified Chinese requests in traditional Chinese.

This phenomenon explained why most servers in Asia responded in simplified Chinese: most Chinese-speaking users live in China and use simplified characters. Most requests coming from China went to servers in Asia, so they mostly received simplified Chinese requests. Then, the servers would get “stuck” in simplified Chinese for a few minutes; whenever they got a traditional Chinese request, they would give a simplified Chinese response.

After figuring this out, I felt a huge relief since it meant that a Nation-State-level traffic interception action hadn’t caused the problem. Instead, it was just a regular, run-of-the-mill programming error. Nevertheless, it still needed to be fixed, and the symptoms suggested that a caching problem caused it.

Developers used XML files to define gadgets. Gadgets could also have translations into several languages, which developers stored in separate XML files (one file for each language), and the Gadget definition file had a list that told which translation file goes with which language.

Every time someone wanted to see a Gadget, the server had to download its XML definition file, parse it, download the required translation file, and parse it, too. Some Gadgets had millions of users, so the server would need to download and parse the same files over and over again. To avoid that, the Gadget Server had a cache.

A cache is a data structure that stores the result of an operation to avoid having to perform that operation repeatedly. In the case of the Gadget Server, the cache stored previously downloaded and parsed XML files. Whenever the server needed a file, it checked if it was already stored there and ready for use. Otherwise, the server would download and parse the file and cache the result to avoid doing that work again later.

My initial theory was that, somehow, the cache could be mixing up the traditional and simplified Chinese translation files. I spent several days inspecting the code and the cached contents, but I couldn’t see any problem. As far as I could tell, the XML file cache’s implementation was correct and worked flawlessly. I would have sworn that the Gadget couldn’t be displayed in the wrong language if I hadn’t seen it myself.

While I inspected the code, I also tried reproducing the problem on my workstation. Production servers would get “stuck” on simplified or traditional Chinese for a few minutes; however, this never happened when I ran the server on my workstation: I would send mixed requests and get mixed responses. Therefore, I couldn’t reproduce the bug in a controlled environment.

That’s why I made a drastic decision: I would attach a debugger to a server in the production network and reproduce the bug there.

Surely enough, I wouldn’t do it on a production server. At the time, we owned several types of servers, not just production servers, that received regular user requests. We also had sandbox servers with no external users; instead, they were there so that iGoogle and other Gadget-using services could perform tests without affecting users. I wouldn’t attach a debugger to a production server and risk affecting external users; I would do it on a sandbox server.

I chose a sandbox server, prepared it, attached a debugger to it, reproduced the bug, investigated it, and, finally, cleaned up and left everything the way it was before. After my investigation, I confirmed that, just as I’d thought, it was a caching problem, but not the caching problem I had expected.

According to my theory, the program would go to the cache to get the file with the traditional Chinese translation, and it would return the wrong file. I wanted to set a breakpoint before the XML file request and see what happened. Surprisingly, the caching system worked correctly: the program requested the traditional Chinese translation file and that’s what the cache provided. The problem had to be somewhere else.

After getting the translation, the program applied it to the Gadget. In translated Gadgets, the definition file didn’t include any text in any language; instead, it had placeholders that the server would replace with text from the translation file. That’s what happened: the server took the XML definition file, looked for the placeholders, and wherever one appeared, replaced it with the corresponding text in traditional Chinese script.

The next step was to parse the resulting XML file.

The unit conversion Gadget had many users, many of whom used the traditional Chinese translation. To serve them, after replacing the placeholders with Chinese text, the server would have to parse the same resulting XML file over and over again. Since the server would have to parse the same XML several times a day, it used a cache to avoid doing all that redundant work. And I had no idea that this cache existed!

That was the cache that gave the wrong result: it would get an XML file with traditional Chinese text and return the result of parsing the same XML file but with simplified Chinese text.

Now, I needed to figure out why that happened.

Caches work by associating a key with a value. For example, the first cache I talked about in this story, which avoided downloading and parsing the same XML files repeatedly, used the file’s URL as the key and the parsed file as the value.

This new cache used a byte array representation of the XML file as its key. The server called the String.getBytes() function to obtain this representation; this function converts a text string to a byte array using the default encoding.

On my workstation, the default encoding was UTF-8. This encoding converts each Chinese character into two or three bytes. For example, UTF-8 represents the string “你好” as the bytes {0xe4, 0xbd, 0xa0, 0xe5, 0xa5, 0xbd}.

On the servers, however, the default encoding was US-ASCII, a very old encoding (1963) that only supports the characters used in the English language, so it can’t encode Chinese characters. Whenever getBytes() finds a character it can’t encode, it replaces it with a question mark. Therefore, the string “你好” is encoded as “??”.

That’s where the problem was: when the server, which used US-ASCII, generated a key, it would consist of an XML file with every Chinese character replaced with a question mark. Since traditional and simplified Chinese translations used the same number of characters, even if the characters were different, the keys always turned out identical, so the server would use the value in the cache, even if it happened to be for the wrong Chinese script.

This problem wasn’t reproducible on my workstation since it used UTF-8, which supports Chinese characters. Therefore, the keys differed, and the cache would return the correct value.

After several weeks of trying this and that, inspecting the code, fighting the cache, and, finally, taking desperate measures, the solution for this bug was to fix all calls to getBytes() so they’d use the UTF-8 encoding explicitly.

This story started as an international spy plot and ended by changing a function to specify the encoding. I guess it’s a bit of an anticlimactic ending. Still, at least all the team members were happy not to have to testify for the US Congress.

This episode taught me the importance of always specifying the parameters the program depends on and never making anything implicit or dependent on the environment. Our server had a bug because it relied on a different configuration parameter in our workstations and production; if we had specified the UTF-8 encoding explicitly from the beginning, this would have never happened to us.

And I wouldn’t have an interesting story to tell.