This month it will be ten years since that time my team and I briefly thought we would be involved in an international diplomatic incident.

It was May 2010, and I had recently moved from the Google office in Dublin, Ireland, to the Mountain View, California, headquarters, to join the Gadget Server team.

One day we got a bug report claiming that, in Taiwan, the unit conversion Gadget wasn’t being shown in traditional Chinese characters, which are used in Taiwan, but in simplified Chinese characters, which are used in mainland China.

In those days, Google and China weren’t going through the best moment in their relationship: some months before, Google had uncovered a series of hacking attacks coming from China and announced that it would close the special censored search engine for China. From that moment on, Chinese users would access the regular uncensored search engine. China was not happy about this, and Google was not happy either.

With so much tension between Google and China, learning that users in Taiwan had started to see a Gadget as if they were in China added the Gadget Server team to the list of unhappy people. Might someone in China be intercepting Google traffic due for Taiwan? We surely hoped not, and even though we didn’t really really expect it to be true, we needed to find out what was happening.

The first step when you get a bug report is to try to reproduce it: you can’t investigate a bug you can’t see. However, any way I tried it, I just couldn’t reproduce it. I would send requests to see the Gadget as a Taiwanese user, but I would always see it in traditional Chinese characters, as if there were no bug. I tried it every which way, with no results.

Then I thought of sending the same request to every server at the same time and checking if there were any differences between them. Our service got requests from all over the world, so we had servers spread throughout the planet; requests from users would arrive automatically to their nearest server. When I did my tests, my requests went to a server in the US, but requests coming from Taiwan would go to a server in Asia. In theory, all servers were identical… but what if it turned out they weren’t?

I wrote a web page that sent requests directly to every server, loaded it on my browser, and then I saw that some servers gave different answers. Most servers in Europe and America responded with the Gadget in traditional Chinese characters, which was the correct result; however, most servers in Asia responded in simplified Chinese.

To add to the mystery, not all servers in each location gave the same result: some gave the right answer and some gave the wrong answer, but the ratio between one and the other was different depending on the location.

After a lot of testing, I realized that there was some kind of memory effect. I would send a request to display the Gadget in simplified Chinese and then, for the next few minutes, the servers would always respond in simplified Chinese whenever I sent requests for traditional Chinese. It also happened the other way around: I would send a request for traditional Chinese, and the servers would keep responding in traditional Chinese to simplified Chinese requests.

This explained why most servers in Asia responded in simplified Chinese: most Chinese-speaking users live in China, so they use simplified characters. Most requests coming from China went to servers in Asia, so they mostly received simplified Chinese requests. Then, the servers would get “stuck” in simplified Chinese for a few minutes and, whenever they got a traditional Chinese request, they would give a simplified Chinese response.

I felt a huge relief after figuring this out, since it meant that the problem hadn’t been caused by a Nation-State-level traffic interception action. Instead, it was just a regular, run of the mill programming error. Nevertheless, it still needed to be fixed, and the symptoms suggested that it was caused by a caching problem.

Gadgets were defined in XML files. Gadgets could also be translated into several languages, so the text in each language was stored in another XML file. The Gadget definition file contained a list that told which translation file goes with which language.

A cache is a data structure that stores the result of an operation to avoid having to perform that operation repeatedly. In the Gadget Server, the cache stored previously downloaded and parsed XML files. Whenever the server needed a file, it first checked if it already was in the cache; if so, it could just use the file directly. Otherwise, the server would download and parse the file and store the result in the cache so it could be used again later.

My initial theory was that, somehow, the cache could be mixing up the traditional and simplified Chinese translation files. I spent several days inspecting the code and the contents of the cache, but I couldn’t see any problem. As far as I could tell, the XML file cache had been implemented correctly and worked perfectly. If I hadn’t seen it myself, I would have sworn that it was impossible for the Gadget to be displayed in the wrong language.

While I was inspecting the code, I also tried to reproduce the problem on my workstation. Production servers would get “stuck” on simplified or traditional Chinese for a few minutes, but this never happened when I ran the server on my workstation: I would send mixed requests and get mixed responses. Therefore, once again, I couldn’t reproduce the bug in a controlled environment.

That’s why I made a drastic decision: I would attach a debugger to a server in the production network and reproduce the bug there.

Surely enough, I wouldn’t do it on a production server. At the time, we had several kinds of server; not only production servers, which received requests from regular users. We also had sandbox servers that had no external users; instead, they were there so that iGoogle and other Gadget-using services could perform tests without affecting users. I wasn’t going to attach a debugger to a production server and risk affecting external users; I would do it on a sandbox server.

I chose a sandbox server, prepared it, attached a debugger to it, reproduced the bug, investigated it, and, finally, cleaned up and left everything the way it was before. After my investigation I confirmed that, just as I’d thought, it was a caching problem, but not the caching problem I had expected.

According to my theory, the program would go to the cache to get the file with the traditional Chinese translation, but the cache would come back with the wrong file. I wanted to set a breakpoint just before the XML file request and see what happened. To my surprise, the cache worked correctly: the program requested the traditional Chinese translation file and the cache provided the traditional Chinese translation. Obviously, the problem had to be somewhere else.

After getting the translation, the program applied it to the Gadget. In translated Gadgets, the definition file didn’t include any text in any language; instead, it included placeholders that the server would replace with text from the translation file. That’s exactly what happened: the server took the XML definition file, looked for the placeholders, and wherever one appeared, it was replaced with the corresponding text in traditional Chinese script.

The next step was to parse the resulting XML file.

The unit conversion Gadget had many users, and a lot of them had it translated to traditional Chinese. This meant that, after replacing the placeholders with Chinese text, the server would have to parse the same resulting XML file over and over again. Since the server would have to parse the same XML several times a day, it used a cache to avoid having to do all that redundant work. And I had no idea that this cache existed!

That was the cache that gave the wrong result: the cache would get an XML file with traditional Chinese text and return the result of parsing the same XML file, but with simplified Chinese text.

Now I needed to figure out why that happened.

Caches work by associating a key to a value. For example, the first cache I talked about in this story, which avoided having to download and parse the same XML files repeatedly, used the file’s URL as the key and the parsed file as the value.

This new cache, which was used to avoid having to parse translated definition files repeatedly, used as the key the XML file represented as a byte array. To compute the key, the server called the String.getBytes() function, which converts a text string to a byte array using the default encoding.

On my workstation, the default encoding was UTF-8. This encoding converts each Chinese character into two or three bytes. For example, UTF-8 represents the string “你好” as the bytes {0xe4, 0xbd, 0xa0, 0xe5, 0xa5, 0xbd}.

On the servers, however, the default encoding was US-ASCII. This is a very old encoding (1963) and only supports the characters used in the English language, so it can’t encode Chinese characters. Whenever getBytes() finds a character it can’t encode, it replaces it with a question mark. Therefore, the string “你好” is encoded as “??”.

That’s where the problem was: when the server, which used US-ASCII, generated a key, it would consist of an XML file with every Chinese character replaced with a question mark. Since traditional and simplified Chinese translations used the same number of characters, even if the characters were different, the keys always turned out identical, so the server would use the value in the cache, even if it was for the wrong Chinese script.

This problem wasn’t reproducible on my workstation since it used UTF-8, which supports Chinese characters. Therefore, the keys were different and the cache would return the correct value.

After several weeks trying this and that, inspecting the code, fighting the cache, and, finally, taking desperate measures, the solution for this bug was to fix all calls to getBytes() so they’d use the UTF-8 encoding explicitly.

This story started as an international spy plot and ended by changing a function to specify the encoding. I guess it’s a bit of an anticlimactic ending, but at least all the team members were happy to not have to testify for the US Congress or anything like that.

All the same, this episode taught me the importance of always specifying the parameters the program depends on, and never making anything implicit or dependent on the environment. Our server had a bug because it depended on a configuration parameter that was different in our workstations and in production; if we had specified the UTF-8 encoding explicitly from the beginning, this would have never happened to us.

And I wouldn’t have a cool story to tell.

You just can’t have everything.

Este artículo ha sido traducido al español: “Mi bug más memorable”.
Other stories about “programming”, “bugs”, “i18n”.