Friday, 1 May 2009

Java: a rough guide to character encoding

It can be tricky figuring out the difference between character handling code that works and code that just appears to work because testing did not encounter cases that exposed bugs. This is a post about some of the pitfalls of character handling in Java.

Topics:

I wrote a little bit about Unicode before.

This post might be exhausting, but it isn't exhaustive.

Unicode in source files

Java source files include support for Unicode. There are two common mechanisms for writing code that includes a range of Unicode characters.

One choice is to encode the source files as Unicode, write the characters and inform the compiler at compile time. javac provides the -encoding <encoding> option for this. The downside is that anyone who uses the source file needs to be aware of this. Otherwise, attempting to compile the file on a system that uses a different default encoding will produce unpredictable results. The bad thing is that it might appear to work.

Code saved as UTF-8, as might be written on an Ubuntu machine:

public class PrintCopyright {
  public static void main(String[] args) {
    System.out.println("© Acme, Inc.");
  }
}

1. Compiling the code using the correct encoding:

javac -encoding UTF-8 PrintCopyright.java

2. Simulating a straight compile on Western-locale Windows:

javac -encoding Cp1252 PrintCopyright.java

These compiler settings will produce different outputs; only the first one is correct.

Note: the JDK 1.6 javac compiler will not compile a UTF-8 source file starting with a byte order mark, failing with the error illegal character: \65279. Byte order marks are discussed further down the page.

The other way to handle Unicode in source files is to only use characters that are encoded with the same values in many character encodings and replace other characters with Unicode escape sequences. In the amended source, the copyright sign © is replaced with the escape sequence \u00A9:

public class PrintCopyright {
  public static void main(String[] args) {
    System.out.println("\u00A9 Acme, Inc.");
  }
}

The amended source, saved as UTF-8, will produce the same output whether processed as UTF-8 or Cp1252. The characters used have the same binary values in both encodings.

These escape sequences apply to the whole file, not just String/char literals. This can produce some funny edge cases. Try compiling this code:

public class WillNotCompile {
  // the path c:\udir is interpreted as an escape sequence
  // the escape \u000A is replaced with a new line
}

Unicode support is detailed in the Lexical Structure chapter of the Java Language Specification.

Unicode and Java data types

Before tackling the encoding API, it is a good idea to get a handle on how text is represented in Java strings.

Grapheme Unicode Character Name(s) Unicode Code Point(s) Java char Literals UTF-16BE
(encoded bytes)
UTF-8
(encoded bytes)
A LATIN_CAPITAL_LETTER_A U+0041 '\u0041' 0041 41
© COPYRIGHT_SIGN U+00A9 '\u00A9' 00A9 C2A9
é LATIN_SMALL_LETTER_E_WITH_ACUTE U+00E9 '\u00E9' 00E9 C3A9
LATIN_SMALL_LETTER_E
COMBINING_ACUTE_ACENT
U+0065
U+0301
'\u0065'
'\u0301'
0065
0301
65
CC81
क्तु DEVENAGARI_LETTER_KA
DEVENAGARI_SIGN_VIRAMA
DEVENAGARI_LETTER_TA
DEVENAGARI_VOWEL_SIGN_U
U+0915
U+094D
U+0924
U+0941
'\u0915'
'\u094D'
'\u0924'
'\u0941'
0915
094D
0924
0941
E0A495
E0A58D
E0A4A4
E0A581
𝔊 MATHEMATICAL_FRAKTUR_CAPITAL_G U+1D50A '\uD835' '\uDD0A' D835DD0A F09D948A
All values are hexadecimal.
Some of the graphemes might not render (though they all seem to work on Firefox).
Table 1

The table above shows some of the things we have to look out for.

1. Stored characters can take up an inconsistent number of bytes. A UTF-8 encoded character might take between one (LATIN_CAPITAL_LETTER_A) and four (MATHEMATICAL_FRAKTUR_CAPITAL_G) bytes. Variable width encoding has implications for reading into and decoding from byte arrays.

2. Characters can be represented in multiple forms. As can be seen with e-acute (é), sometimes there is more than one way to store and render a grapheme. Sometimes they can be formed using combining sequences (as in the e-acute example); sometimes there are similar characters (Greek mu μ vs Mathematical micro µ). This may have implications for sorting, expression matching and capitalisation. It might raise compatibility issues for translating data between encodings. You can normalise strings using the Normalizer class, but be aware of any gotchas.

3. Not all code points can be stored in a char. The MATHEMATICAL_FRAKTUR_CAPITAL_G example lies in the supplementary range of characters and cannot be stored in 16 bits. It must be represented by two sequential char values, neither of which is meaningful by itself. The Character class provides methods for working with 32-bit code points.

    // Unicode code point to char array
    char[] math_fraktur_cap_g = Character.toChars(0x1D50A);

4. The relationship between the grapheme visible to the user and the code point type may not be 1:1. This can be seen in the combining character sequences (the e-acute example). As the Devenagari example shows, combining sequences can get quite complex.

How long is a (piece of) String?

So, if a "character" can span multiple char values, we need to know how that affects our ability to calculate string lengths.

The char count is likely to be most useful for low level data operations; the grapheme count is the one you would use to do your own line wrapping in a GUI. You can find out if a particular font can render your string using the Font class.

Code to get the various lengths:

  /**
   @param type
   *          the type of string length
   @param text
   *          non-null text to inspect
   @return the length of the string according to the criteria
   */
  public static int getLength(CountType type, String text) {
    switch (type) {
    case CHARS:
      return text.length();
    case CODEPOINTS:
      return text.codePointCount(0, text.length());
    case GRAPHEMES:
      int graphemeCount = 0;
      BreakIterator graphemeCounter = BreakIterator
          .getCharacterInstance();
      graphemeCounter.setText(text);
      while (graphemeCounter.next() != BreakIterator.DONE) {
        graphemeCount++;
      }
      return graphemeCount;
    default:
      throw new IllegalArgumentException("" + type);
    }
  }

  enum CountType {
    CHARS, CODEPOINTS, GRAPHEMES
  }

Sample strings and their lengths by the various criteria:

Unicode code points graphemes char count code point count grapheme count
U+0041 A 1 1 1
U+00E9 é 1 1 1
U+0065 U+0301 2 2 1
U+0915 U+094D U+0924 U+0941 क्तु 4 4 1
U+1D50A 𝔊 2 1 1
all of the above Aééक्तु𝔊 10 9 5
Table 2

Encodings

Character encodings map characters to byte representations. The Unicode character set is mapped to bytes using Unicode transformation formats (UTF-8, UTF-16, UTF-32, etc.). Most encodings can represent only a subset of the characters supported by Unicode. Java strings use UTF-16.

Code that prints the byte representation of the pound sign (£) in different encodings:

    String[] encodings = "Cp1252"// Windows-1252
        "UTF-8"// Unicode UTF-8
        "UTF-16BE" // Unicode UTF-16, big endian
    };

    String poundSign = "\u00A3";
    for (String encoding : encodings) {
      System.out.format("%10s%3s  ", encoding, poundSign);
      byte[] encoded = poundSign.getBytes(encoding);
      for (byte b : encoded) {
        System.out.format("%02x ", b);
      }
      System.out.println();
    }

Output of the code:

    Cp1252  £  a3
     UTF8   £  c2 a3
  UTF-16BE  £  00 a3

Java uses two mechanisms to represent supported encodings. The initial mechanism was via string IDs. Java 1.4 introduce the type-safe Charset class. Note that the two mechanisms use different canonical names to represent the same encodings.

Java 6 implementations are only required to support six encodings (US-ASCII; ISO-8859-1; UTF-8; UTF-16BE; UTF-16LE; UTF-16). In practice, they tend to support many more:

Other JREs (e.g. IBM; Oracle JRockit) may support different sets of encodings.

Every platform has a default encoding:

    Charset charset = Charset.defaultCharset();
    System.out.println("Default encoding: " + charset + " (Aliases: "
        + charset.aliases() ")");

If a standard Java API converts between byte and char data, there is a high probability that it is using the default encoding. Examples of such calls are: String(byte[]); String.getBytes(); InputStreamReader(InputStream); OutputStreamWriter(OutputStream); anything in FileReader or FileWriter.

The problem with using these calls is that it is that you cannot predict whether data written on one machine will be read correctly on another, even if you use the same application. An English Windows PC will use Windows-1252; a Russian Windows PC will use Windows-1251; an Ubuntu machine will use UTF-8. Prefer encoding methods like OutputStreamWriter(OutputStream, Charset) that let you set the encoding explicitly. Fall back on the default encoding only when you have no choice.

Portability is not the only problem with relying on the default encoding. As will be shown, using the old 8-bit encodings that many systems use for compatibility reasons can also result in data loss.

Encoding operations transform data. Do not try to pass "binary" data through character methods. A Java char is not a C char.

UPDATE (Dec 2009): A quick note on the file.encoding system property, which you may see mentioned in Java encoding discussions. Do not set or rely on this property. Sun says:

The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

Stream encoding

This code reliably writes and then reads character data using a file:

  public static void writeToFile(File file,
      Charset charset, String datathrows IOException {
    OutputStream out = new FileOutputStream(file);
    Closeable stream = out;
    try {
      Writer writer = new OutputStreamWriter(out, charset);
      stream = writer;
      writer.write(data);
    finally {
      stream.close();
    }
  }

  public static String readFromFile(File file,
      Charset charsetthrows IOException {
    InputStream in = new FileInputStream(file);
    Closeable stream = in;
    try {
      Reader reader = new InputStreamReader(in, charset);
      stream = reader;
      StringBuilder inputBuilder = new StringBuilder();
      char[] buffer = new char[1024];
      while (true) {
        int readCount = reader.read(buffer);
        if (readCount < 0) {
          break;
        }
        inputBuilder.append(buffer, 0, readCount);
      }
      return inputBuilder.toString();
    finally {
      stream.close();
    }
  }

  public static void main(String[] args) {
    String mathematical_fraktur_capital_g = "\uD835\uDD0A";
    // use a Unicode encoding
    Charset utf8 = Charset.forName("UTF-8");
    File file = new File("test.txt");
    // write the file
    try {
      writeToFile(file, utf8,
          mathematical_fraktur_capital_g);
      String input = readFromFile(file, utf8);
      if (input.equals(mathematical_fraktur_capital_g)) {
        System.out.println("OK");
      else {
        System.err.println("DATA LOSS ERROR");
      }
    catch (IOException e) {
      System.err.println("IO ERROR");
      e.printStackTrace();
    }
  }

The same encoding is used for both operations, so they are symmetrical. A Unicode encoding is used, so no data loss will occur.

One potential problem area is wrapping a stream with a writer when there data has already been written (or when appending data to a file). Be careful of byte order marks. This is an identical problem to the one discussed below using string encoding calls.

Lossy conversions

Saving to the operating system's default encoding can be a lossy process. This code simulates what happens when the mathematical expression x ≠ y is saved and loaded on an English Windows system using its default file encoding:

    // \u2260 NOT EQUAL TO
    String mathematicalExpression = "x \u2260 y";
    // English Windows charset
    Charset windows1252 = Charset.forName("windows-1252");
    // Simulate save/load
    byte[] encoded = mathematicalExpression
        .getBytes(windows1252);
    String decoded = new String(encoded, windows1252);
    // exception will be thrown:
    if (!mathematicalExpression.equals(decoded)) {
      throw new Exception("Information was lost");
    }

The symbol NOT_EQUAL_TO (≠) is replaced with a question mark during encoding. The Windows-1252 character set does not include that character.

Use non-Unicode encodings reluctantly. Prefer self-describing file formats that support Unicode (like XML [spec]) or formats that mandate Unicode (like JSON [spec]).

Note: normalisation and encoding are distinct operations; the combining sequence U+0065 U+0301 (é) will (using windows-1252) be encoded to the bytes 65 3F (e?). The API will not normalise text automatically.

Unicode byte order marks

Byte order marks (BOMs) are a sequence of bytes that are sometimes used to mark the start of the data with information about the encoding in use. Sometimes Unicode BOMs are mandatory; sometimes they must not be used. The Unicode BOM FAQ contains more details, including a section on how to deal with them.

This code generates the byte order marks for selected Unicode encodings:

    // Encode this to get byte order mark
    final String bomChar = "\uFEFF";
    // Unicode encodings
    String[] unicodeEncodings = "UTF-8""UTF-16BE""UTF-16LE""UTF-32BE",
        "UTF-32LE" };
    // Print the byte order marks
    for (String encName : unicodeEncodings) {
      Charset charset = Charset.forName(encName);
      byte[] byteOrderMark = bomChar.getBytes(charset);
      System.out.format("%10s BOM: ", charset.toString());
      for (byte b : byteOrderMark) {
        System.out.format("%02x ", b);
      }
      System.out.println();

The generated output:

     UTF-8 BOM: ef bb bf 
  UTF-16BE BOM: fe ff 
  UTF-16LE BOM: ff fe 
  UTF-32BE BOM: 00 00 fe ff 
  UTF-32LE BOM: ff fe 00 00 

Some encodings will automatically emit byte order marks on encode and read them on decode. Select the encoding scheme appropriate for your use case.

This code inserts a BOM automatically:

    Charset charset = Charset.forName("UTF-16");
    byte[] encodedBytes = "hello".getBytes(charset);
    System.out.format("%10s encoded: ", charset.toString());
    for (byte b : encodedBytes) {
      System.out.format("%02x ", b);
    }
    System.out.println();

The output, prefixed with the fe ff big endian BOM:

    UTF-16 encoded: fe ff 00 68 00 65 00 6c 00 6c 00 6f 

Automatic detection of encoding

Automatic detection of encoding has been tried (notably in Windows), but there is no way to get it to work with one hundred percent reliability for an arbitrary text file. Notepad detects the windows-1252 encoded string this app can break as being CJK characters.

The International Components for Unicode project includes support for character set detection.

    byte[] thisAppCanBreak = "this app can break"
        .getBytes("ISO-8859-1");
    CharsetDetector detector = new CharsetDetector();
    detector.setText(thisAppCanBreak);
    String tableTemplate = "%10s %10s %8s%n";
    System.out.format(tableTemplate, "CONFIDENCE",
        "CHARSET""LANGUAGE");
    for (CharsetMatch match : detector.detectAll()) {
      System.out.format(tableTemplate, match
          .getConfidence(), match.getName(), match
          .getLanguage());
    }

Output:

CONFIDENCE    CHARSET LANGUAGE
        63 ISO-8859-2       hu
        47 ISO-8859-9       tr
        47 ISO-8859-1       nl
        47 ISO-8859-1       en
        31 ISO-8859-2       ro
        31 ISO-8859-1       fr
        15 ISO-8859-1       sv
        15 ISO-8859-1       pt
        15 ISO-8859-1       es
        15 ISO-8859-1       da
        10       Big5       zh
        10     EUC-KR       ko
        10     EUC-JP       ja
        10    GB18030       zh
        10  Shift_JIS       ja
        10      UTF-8     null

The chosen text will decode correctly if interpreted as ISO-8859-2 Hungarian, so it looks like we will have to try harder in the selection of data that will cause character corruption. As the ICU project notes about detection:

This is, at best, an imprecise operation using statistics and heuristics.

The potential pitfalls of encoding and decoding using the String class

The java.lang.String class can be used to encode and decode data, though be aware of problem areas.

This code demonstrates how attempting to use string methods to decode multibyte-encoded text from a stream can lead to character corruption:

  /** buggy decode method - do not use! */
  private static String decodeFromStream(
      InputStream stream, Charset encoding)
      throws IOException {
    StringBuilder builder = new StringBuilder();
    byte[] buffer = new byte[4];
    while (true) {
      // read bytes into buffer
      int r = stream.read(buffer);
      if (r < 0) {
        break;
      }
      // decode byte data into char data
      String data = new String(buffer, 0, r, encoding);
      builder.append(data);
    }
    return builder.toString();
  }

  public static void main(String[] argsthrows IOException {
    Charset utf8 = Charset.forName("UTF-8");
    String original = "abc\u00A3";
    byte[] encoded = original.getBytes(utf8);
    // create mock file input stream
    InputStream stream = new ByteArrayInputStream(encoded);
    // decode the data
    String decoded = decodeFromStream(stream, utf8);

    if (!original.equals(decoded)) {
      throw new IllegalStateException("Lost data");
    }
  }

The flaw in this code is that characters in the stream can span reads into the byte buffer. The character U+00A3 (£) becomes corrupted as one half of it ends up in the tail end of the buffer in one pass and the other half ends up at the start of the buffer in the next pass. Neither value is meaningful by itself. Because a UTF-8 encoded character may be between one and four bytes in length, there is no buffer size that will always read whole characters. The entire data would need to be read into a larger byte buffer and decoded in one go. It is safer to use an InputStreamReader.

This code demonstrates how data corruption can occur when encoding text incrementally:

  /** buggy encoding method - do not use! */
  private static void writeToStream(OutputStream out,
      String text, Charset encodingthrows IOException {
    byte[] encoded = text.getBytes(encoding);
    out.write(encoded);
  }

  public static void main(String[] argsthrows IOException {
    Charset utf16 = Charset.forName("UTF-16");
    String[] data = "abc""def" };
    String concatenated = "";
    // buffer standing in for a file
    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
    for (String segment : data) {
      concatenated += segment;
      // encode and write the segment to a simulated file
      writeToStream(buffer, segment, utf16);
    }

    byte[] encoded = buffer.toByteArray();
    String decoded = new String(encoded, utf16);

    if (!concatenated.equals(decoded)) {
      throw new IllegalStateException("Data corrupted");
    }
  }

The problem here is that the UTF-16 encoding scheme used adds a byte order mark to encoded data. Every time a string is written, another BOM is added, littering the content with unwanted data. When the data is decoded, extra characters end up in the text. All text would need to be concatenated and encoded in one go. It is better to use an OutputStreamWriter.

There are encodings for which these operations are safe, but you would be tying your method implementations to specific encodings rather than making them general. Use string class encode/decode methods only on whole data.

Notes

System.out may not always print the results you expect. I posted about this in terms of the Windows console, but Windows is not the only system with this sort of behaviour.

If you feel you are on rocky ground with character encodings, Bob Balaban has a good presentation: Multi-Language Character Sets: What They Are, How to Use Them (2001; PDF) [alt link]. He worked on character issues for more than a little while. Chances are that you are not a Lotus Notes developer, so you can skip some of the specifics.

The ICU4J library is a good place to turn if you you're trying to find Unicode (and other I18N) features not included in the Java standard library.

Examples were developed using Java 6.

7 comments:

  1. Hi. Nice article. I've had trouble with, however, the process of reading (first) and writing to file on linux.
    I have no troubles doing this on windows. And don't understand why.
    The file I read, according to firefox, is of character encoding Windows-1250. Which as we know is the default on windows. Whereas UTF-8 is on linux. Which is the main difference, I would think.
    But I read in using the same encoding as I write out, like your example. It works on windows, and it doesn't on linux. That is, My umlauts in the text get turned into garbage.
    Any idea what I need to look at to make this work on linux?

    ReplyDelete
  2. Hi Sean, I can't tell what the problem is based on your description. Diagnosing character encoding issues can be tricky. I would post the problem on a Q&A site like stackoverflow.com. Pare your code down to the minimum required to reproduce the problem and post the hexadecimal form of the text files your application reads/produces (you can use xxd on Linux produce the hex dump).

    ReplyDelete
  3. Would you, by any chance, have an idea on how to make an approximate conversion between encodings?

    Like, avoiding '?' when a character is unknown.

    For example, the character ... in unicode, can be translated as 3 . in ascii.

    Thanks in advance

    ReplyDelete
  4. Hi Simon. The Unicode FAQ on Conversions/Mappings doesn't offer much in the way of general advice that will help you.

    Assuming the ellipsis character is U+2026 (…) then using java.text.Normalizer.Form.NFKD will turn it into three periods (U+002E). However, be aware that normalizing a string this way may transform other characters in ways you don't intend.

    ReplyDelete
  5. Indeed a great post. character encoding is very important when converting byte array to String. I have also blogged my experience as How to find default character encoding or charset in java to point out issues with file.encoding. let me know how do you find it.

    ReplyDelete
  6. Excellent! Could you please expandin section [How long is a (piece of) String?](http://illegalargumentexception.blogspot.gr/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_strnlen) as to why for instance U+1D50A is 2 chars (and onew codepoint (and similar for the grapheme that's 4 points). Also in the same section instead of a switch you could use an enum method - return enymInstance.getLength(str) - the switch just blurs the picture I think

    ReplyDelete
    Replies
    1. Code points are unique writing system constructs identified and supported by the Unicode consortium. Consider accented characters like é. In European, Latin-based languages the number of accents (diacritics) available on a letter is usually one at a time. Other writing systems are more complex. So Unicode must allow the building of visible graphemes using more than one diacritic - (e.g. the letter, followed by the accent). There are too many combinations to represent every form. This leads to the situation where there is more than one way to describe é in code points.

      Code points are an abstraction: define a writing system element; give it a number.

      Everything must be represented in memory somehow in a computer system. These use cases allocate units of memory (8/16/32 bits) - code units to represent code points - and are called Unicode Transformation Formats. In UTF-8, a code point is 1, 2, 3 or 4 code units. In UTF-16 a code point is 1 or 2 code units. In UTF-32 a code point is a code unit.

      Java strings are UTF-16 so there can be a difference between the number of code units and the number of code points.

      I encourage reading the latest Unicode specification for more information.

      I will leave the code as it is - I hope it describes how to measure the differences clearly enough.

      Delete

All comments are moderated