This is my attempt at a list of maxims to abide by when working with text in Java, in the vein of Effective Java or The Ten Commandments of Unicode. It is also a summary of another post on character encoding. The list is in no way comprehensive.
1: Understand how UTF-8 differs from ASCII (and CP437, and Windows-1252, and ISO-8859-1, and ISO-8859-15, and MacRoman, etc.)
Understand how using the wrong encoding can lead to data corruption [details].
2: Prefer data formats that either mandate Unicode or support Unicode and are self describing
Be reluctant to use the operating system default encoding and understand how it can result in data loss [details]. Use unambiguous formats to make data more portable [details].
3: Understand what byte order marks are and which encoding schemes use them
Know how to detect and handle BOMs and pick Unicode encodings accordingly [details].
4: Understand how encoding affects Java source files
Judicious use of Unicode escape sequences can improve file portability [details].
5: Do not assume that all characters in an arbitrary encoding are stored in the same number of bytes
Understand how variable width encoding works [details].
6: Understand why "é".equals("é")
can return false
Be aware of canonically-equivalent strings and normalisation [details].
7: Understand that the number of "characters" in a string may
not equal the number of char
elements it contains
Be aware of combining sequences [details]. Be aware of the supplementary character range [details]. Know how to calculate the string length appropriate for your algorithm [details].
8: Use string class encode/decode methods only on whole data
Understand how the misuse of encoding methods can lead to data corruption [details].
No comments:
Post a Comment
All comments are moderated