Java: effective Unicode

Saturday, 18 July 2009

Java: effective Unicode

This is my attempt at a list of maxims to abide by when working with text in Java, in the vein of Effective Java or The Ten Commandments of Unicode. It is also a summary of another post on character encoding. The list is in no way comprehensive.

1: Understand how UTF-8 differs from ASCII (and CP437, and Windows-1252, and ISO-8859-1, and ISO-8859-15, and MacRoman, etc.)

Understand how using the wrong encoding can lead to data corruption [details].

2: Prefer data formats that either mandate Unicode or support Unicode and are self describing

Be reluctant to use the operating system default encoding and understand how it can result in data loss [details]. Use unambiguous formats to make data more portable [details].

3: Understand what byte order marks are and which encoding schemes use them

Know how to detect and handle BOMs and pick Unicode encodings accordingly [details].

4: Understand how encoding affects Java source files

Judicious use of Unicode escape sequences can improve file portability [details].

5: Do not assume that all characters in an arbitrary encoding are stored in the same number of bytes

Understand how variable width encoding works [details].

6: Understand why `"é".equals("é")` can return `false`

Be aware of canonically-equivalent strings and normalisation [details].

7: Understand that the number of "characters" in a string may not equal the number of `char` elements it contains

Be aware of combining sequences [details]. Be aware of the supplementary character range [details]. Know how to calculate the string length appropriate for your algorithm [details].

8: Use string class encode/decode methods only on whole data

Understand how the misuse of encoding methods can lead to data corruption [details].

Illegal Argument Exception

Saturday, 18 July 2009