This is my attempt at a list of maxims to abide by when working with text in Java, in the vein of Effective Java or The Ten Commandments of Unicode. It is also a summary of another post on character encoding. The list is in no way comprehensive.
1: Understand how UTF-8 differs from ASCII (and CP437, and Windows-1252, and ISO-8859-1, and ISO-8859-15, and MacRoman, etc.)
Understand how using the wrong encoding can lead to data corruption [details].
2: Prefer data formats that either mandate Unicode or support Unicode and are self describing
3: Understand what byte order marks are and which encoding schemes use them
Know how to detect and handle BOMs and pick Unicode encodings accordingly [details].
4: Understand how encoding affects Java source files
Judicious use of Unicode escape sequences can improve file portability [details].
5: Do not assume that all characters in an arbitrary encoding are stored in the same number of bytes
Understand how variable width encoding works [details].
6: Understand why
Be aware of canonically-equivalent strings and normalisation [details].
7: Understand that the number of "characters" in a string may
not equal the number of
char elements it contains
8: Use string class encode/decode methods only on whole data
Understand how the misuse of encoding methods can lead to data corruption [details].