I was asked to track down some issues today with Japanese characters getting garbled as they flowed through our system. I tracked the problem down to some code that was a building a zip file to be sent across the wire. I grabbed the zip in transit, and cracked it open to look at the files. I opened one of the files in a hex editor (NotePad++ with HEX-Editor plugin) and looked at the particular characters in question. They were all 3F and displayed as ?. The fact that these were all the same indicated that the characters were garbled already.
The file in question was actually an XML fragment which was built in the code using JDOM. The code was converting the JDOM objects into a string then into bytes which were written to the ZipEntry. I backtracked a bit and looked into what was happening in the XML to string conversion. The code was using XMLOutputter with Format.getRawFormat(). This method returns a Format that sets a bunch of common options and uses UTF-8. So, in the debugger I checked that the string coming out of XMLOutputter looked valid, and indeed it did.
At this point, I had good XML input and bad file output, so I had the problem cornered and there was really only one transformation in the middle. The code that was converting from string to bytes was using String.getBytes(), which the javadoc states uses the default character encoding on your system. You can determine your default encoding with Charset.defaultCharset().displayName().
On my machine, the default charset is windows-1252, although it could of course be anything, which makes it a bad thing to rely on. Windows-1252 is an 8-bit (alarms should go off) Windows code page, which is a superset of ISO-8859-1, which is basically the Latin characters. So, it handles all your common Latin languages with accents, umlauts, and all that good stuff. However, it is completely broken for any characters outside that (relatively) small set of codepoints such as my Japanese characters, which are getting defaulted to ‘?’.
A much better idea is to specify which encoding we are using and it seemed like a really good idea to have that encoding match the encoding already being used by JDOM’s XMLOutputter, so I used UTF-8: [source:java]
String data = …xml string data…
Charset cs = Charset.forName(“UTF-8”);
ByteBuffer bytes = cs.encode(data);
byte encodedBytes = bytes.array();
Of course, when I changed this code, the code on the server side started failing as that code is presumably also using the default system encoding. So, the issue is not quite finished yet, but I think we’re on the way to a solution.
The lesson is that every developer these days should know at least a little about Unicode, codepoints, and character encodings. About once a year something like this comes up that makes me re-learn it all, but over time things get hazy again. Some helpful links that you might want to read to learn more:
- Joel Spolsky – The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – a must read
- Supplementary Characters in the Java Platform – details how supplementary Unicode characters are supported in Java, but also generally informative
- Java Internationalization home
- The Unicode Standard
A couple of other things I ran across while looking stuff up for the bug or this blog that may be useful are: