It can be tricky figuring out the difference between character handling code that works and code that just appears to work because testing did not encounter cases that exposed bugs. This is a post about some of the pitfalls of character handling in Java.
Topics:
- Unicode in source files
- Unicode and Java data types
- How long is a (piece of) String?
- Encodings
- Stream encoding
- Lossy conversions
- Unicode byte order marks
- Automatic detection of encoding
- The potential pitfalls of encoding and decoding using the String class
- Notes
I wrote a little bit about Unicode before.
This post might be exhausting, but it isn't exhaustive.
Unicode in source files
Java source files include support for Unicode. There are two common mechanisms for writing code that includes a range of Unicode characters.
One choice is to encode the source files as Unicode, write the
characters and inform the compiler at compile time. javac
provides the -encoding <encoding>
option for this.
The downside is that anyone who uses the source file needs to be aware
of this. Otherwise, attempting to compile the file on a system that uses
a different default encoding will produce unpredictable results. The bad
thing is that it might appear to work.
Code saved as UTF-8, as might be written on an Ubuntu machine:
public class PrintCopyright { public static void main(String[] args) { System.out.println("© Acme, Inc."); } }
1. Compiling the code using the correct encoding:
javac -encoding UTF-8 PrintCopyright.java
2. Simulating a straight compile on Western-locale Windows:
javac -encoding Cp1252 PrintCopyright.java
These compiler settings will produce different outputs; only the first one is correct.
Note: the JDK 1.6 javac
compiler will not
compile a UTF-8 source file starting with a byte order mark, failing
with the error illegal character: \65279
. Byte order marks
are discussed further down the page.
The other way to handle Unicode in source files is to only use
characters that are encoded with the same values in many character
encodings and replace other characters with Unicode escape sequences. In
the amended source, the copyright sign ©
is replaced
with the escape sequence \u00A9
:
public class PrintCopyright { public static void main(String[] args) { System.out.println("\u00A9 Acme, Inc."); } }
The amended source, saved as UTF-8, will produce the same output whether processed as UTF-8 or Cp1252. The characters used have the same binary values in both encodings.
These escape sequences apply to the whole file, not just String
/char
literals. This can produce some funny edge cases. Try compiling this
code:
public class WillNotCompile { // the path c:\udir is interpreted as an escape sequence // the escape \u000A is replaced with a new line }
Unicode support is detailed in the Lexical Structure chapter of the Java Language Specification.
Unicode and Java data types
Before tackling the encoding API, it is a good idea to get a handle on how text is represented in Java strings.
Grapheme | Unicode Character Name(s) | Unicode Code Point(s) | Java char Literals |
UTF-16BE (encoded bytes) |
UTF-8 (encoded bytes) |
---|---|---|---|---|---|
A | LATIN_CAPITAL_LETTER_A | U+0041 | '\u0041' | 0041 | 41 |
© | COPYRIGHT_SIGN | U+00A9 | '\u00A9' | 00A9 | C2A9 |
é | LATIN_SMALL_LETTER_E_WITH_ACUTE | U+00E9 | '\u00E9' | 00E9 | C3A9 |
é | LATIN_SMALL_LETTER_E COMBINING_ACUTE_ACENT |
U+0065 U+0301 |
'\u0065' '\u0301' |
0065 0301 |
65 CC81 |
क्तु | DEVENAGARI_LETTER_KA DEVENAGARI_SIGN_VIRAMA DEVENAGARI_LETTER_TA DEVENAGARI_VOWEL_SIGN_U |
U+0915 U+094D U+0924 U+0941 |
'\u0915' '\u094D' '\u0924' '\u0941' |
0915 094D 0924 0941 |
E0A495 E0A58D E0A4A4 E0A581 |
𝔊 | MATHEMATICAL_FRAKTUR_CAPITAL_G | U+1D50A | '\uD835' '\uDD0A' | D835DD0A | F09D948A |
All values are hexadecimal. Some of the graphemes might not render (though they all seem to work on Firefox). |
|||||
Table 1 |
The table above shows some of the things we have to look out for.
1. Stored characters can take up an inconsistent number of bytes. A UTF-8 encoded character might take between one (LATIN_CAPITAL_LETTER_A) and four (MATHEMATICAL_FRAKTUR_CAPITAL_G) bytes. Variable width encoding has implications for reading into and decoding from byte arrays.
2. Characters can be represented in multiple
forms. As can be seen with e-acute (é
), sometimes
there is more than one way to store and render a grapheme. Sometimes
they can be formed using combining sequences (as in the e-acute
example); sometimes there are similar characters (Greek mu μ vs
Mathematical micro µ). This may have implications for sorting,
expression matching and capitalisation. It might raise compatibility
issues for translating data between encodings. You can normalise strings
using the Normalizer
class, but be aware of any gotchas.
3. Not all code points can be stored in a char
.
The MATHEMATICAL_FRAKTUR_CAPITAL_G example lies in the supplementary
range of characters and cannot be stored in 16 bits. It must be
represented by two sequential char
values, neither of which
is meaningful by itself. The Character
class provides methods for working with 32-bit code points.
// Unicode code point to char array |
4. The relationship between the grapheme visible to the user and the code point type may not be 1:1. This can be seen in the combining character sequences (the e-acute example). As the Devenagari example shows, combining sequences can get quite complex.
How long is a (piece of) String?
So, if a "character" can span multiple char
values,
we need to know how that affects our ability to calculate string
lengths.
- String.length()
returns the number of
char
s in the String. - String.codePointCount(int, int) returns the number of Unicode code points in the String.
- BreakIterator.getCharacterInstance() can be used to count the number of graphemes in a String.
The char
count is likely to be most useful for low
level data operations; the grapheme count is the one you would use to do
your own line wrapping in a GUI. You can find out if a particular font
can render your string using the Font
class.
Code to get the various lengths:
/**
|
Sample strings and their lengths by the various criteria:
Unicode code points | graphemes | char count | code point count | grapheme count |
---|---|---|---|---|
U+0041 | A | 1 | 1 | 1 |
U+00E9 | é | 1 | 1 | 1 |
U+0065 U+0301 | é | 2 | 2 | 1 |
U+0915 U+094D U+0924 U+0941 | क्तु | 4 | 4 | 1 |
U+1D50A | 𝔊 | 2 | 1 | 1 |
all of the above | Aééक्तु𝔊 | 10 | 9 | 5 |
Table 2 |
Encodings
Character encodings map characters to byte representations. The
Unicode character set is mapped to bytes using Unicode transformation
formats (UTF-8, UTF-16, UTF-32, etc.). Most encodings can represent only
a subset of the characters supported by Unicode. Java strings use UTF-16
.
Code that prints the byte representation of the pound sign (£) in different encodings:
String[] encodings = { "Cp1252", // Windows-1252 |
Output of the code:
Cp1252 £ a3 UTF8 £ c2 a3 UTF-16BE £ 00 a3
Java uses two mechanisms to represent supported encodings. The initial mechanism was via string IDs. Java 1.4 introduce the type-safe Charset class. Note that the two mechanisms use different canonical names to represent the same encodings.
Java 6 implementations are only required to support six encodings (US-ASCII; ISO-8859-1; UTF-8; UTF-16BE; UTF-16LE; UTF-16). In practice, they tend to support many more:
Other JREs (e.g. IBM; Oracle JRockit) may support different sets of encodings.
Every platform has a default encoding:
Charset charset = Charset.defaultCharset(); |
If a standard Java API converts between byte
and char
data, there is a high probability that it is using the default encoding.
Examples of such calls are: String(byte[]);
String.getBytes();
InputStreamReader(InputStream);
OutputStreamWriter(OutputStream);
anything in FileReader
or FileWriter.
The problem with using these calls is that it is that you cannot
predict whether data written on one machine will be read correctly on
another, even if you use the same application. An English Windows PC
will use Windows-1252
; a Russian Windows PC will use Windows-1251
;
an Ubuntu machine will use UTF-8
. Prefer
encoding methods like OutputStreamWriter(OutputStream,
Charset) that let you set the encoding explicitly. Fall back on the
default encoding only when you have no choice.
Portability is not the only problem with relying on the default encoding. As will be shown, using the old 8-bit encodings that many systems use for compatibility reasons can also result in data loss.
Encoding operations transform data. Do not try to pass
"binary" data through character methods. A Java char
is not
a C char
.
UPDATE (Dec 2009): A quick note on the file.encoding
system property, which you may see mentioned in Java encoding
discussions. Do not set or rely on this property. Sun says:
The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.
Stream encoding
This code reliably writes and then reads character data using a file:
public static void writeToFile(File file,
|
The same encoding is used for both operations, so they are symmetrical. A Unicode encoding is used, so no data loss will occur.
One potential problem area is wrapping a stream with a writer when there data has already been written (or when appending data to a file). Be careful of byte order marks. This is an identical problem to the one discussed below using string encoding calls.
Lossy conversions
Saving to the operating system's default encoding can be a lossy
process. This code simulates what happens when the mathematical
expression x ≠ y
is saved and loaded on an English
Windows system using its default file encoding:
// \u2260 NOT EQUAL TO |
The symbol NOT_EQUAL_TO (≠) is replaced with a question
mark during encoding. The Windows-1252
character set does
not include that character.
Use non-Unicode encodings reluctantly. Prefer self-describing file formats that support Unicode (like XML [spec]) or formats that mandate Unicode (like JSON [spec]).
Note: normalisation and encoding are distinct
operations; the combining sequence U+0065 U+0301 (é) will
(using windows-1252
) be encoded to the bytes 65
3F
(e?). The API will not normalise text automatically.
Unicode byte order marks
Byte order marks (BOMs) are a sequence of bytes that are sometimes used to mark the start of the data with information about the encoding in use. Sometimes Unicode BOMs are mandatory; sometimes they must not be used. The Unicode BOM FAQ contains more details, including a section on how to deal with them.
This code generates the byte order marks for selected Unicode encodings:
// Encode this to get byte order mark |
The generated output:
UTF-8 BOM: ef bb bf UTF-16BE BOM: fe ff UTF-16LE BOM: ff fe UTF-32BE BOM: 00 00 fe ff UTF-32LE BOM: ff fe 00 00
Some encodings will automatically emit byte order marks on encode and read them on decode. Select the encoding scheme appropriate for your use case.
This code inserts a BOM automatically:
Charset charset = Charset.forName("UTF-16"); |
The output, prefixed with the fe ff
big endian BOM:
UTF-16 encoded: fe ff 00 68 00 65 00 6c 00 6c 00 6f
Automatic detection of encoding
Automatic detection of encoding has been tried (notably
in Windows), but there is no way to get it to work with one hundred
percent reliability for an arbitrary text file. Notepad detects the windows-1252
encoded string this app can break
as being
CJK characters.
The International Components for Unicode project includes support for character set detection.
byte[] thisAppCanBreak = "this app can break" |
Output:
CONFIDENCE CHARSET LANGUAGE 63 ISO-8859-2 hu 47 ISO-8859-9 tr 47 ISO-8859-1 nl 47 ISO-8859-1 en 31 ISO-8859-2 ro 31 ISO-8859-1 fr 15 ISO-8859-1 sv 15 ISO-8859-1 pt 15 ISO-8859-1 es 15 ISO-8859-1 da 10 Big5 zh 10 EUC-KR ko 10 EUC-JP ja 10 GB18030 zh 10 Shift_JIS ja 10 UTF-8 null
The chosen text will decode correctly if interpreted as ISO-8859-2 Hungarian, so it looks like we will have to try harder in the selection of data that will cause character corruption. As the ICU project notes about detection:
This is, at best, an imprecise operation using statistics and heuristics.
The potential pitfalls of encoding and decoding using the String class
The java.lang.String
class can be used to encode
and decode
data, though be aware of problem areas.
This code demonstrates how attempting to use string methods to decode multibyte-encoded text from a stream can lead to character corruption:
/** buggy decode method - do not use! */
|
The flaw in this code is that characters in the stream can span reads into the byte buffer. The character U+00A3 (£) becomes corrupted as one half of it ends up in the tail end of the buffer in one pass and the other half ends up at the start of the buffer in the next pass. Neither value is meaningful by itself. Because a UTF-8 encoded character may be between one and four bytes in length, there is no buffer size that will always read whole characters. The entire data would need to be read into a larger byte buffer and decoded in one go. It is safer to use an InputStreamReader.
This code demonstrates how data corruption can occur when encoding text incrementally:
/** buggy encoding method - do not use! */
|
The problem here is that the UTF-16
encoding scheme
used adds a byte order mark to encoded data. Every time a string is
written, another BOM is added, littering the content with unwanted data.
When the data is decoded, extra characters end up in the text. All text
would need to be concatenated and encoded in one go. It is better to use
an OutputStreamWriter.
There are encodings for which these operations are safe, but you would be tying your method implementations to specific encodings rather than making them general. Use string class encode/decode methods only on whole data.
Notes
System.out
may not always print the results you
expect. I posted
about this in terms of the Windows console, but Windows is not the only
system with this sort of behaviour.
If you feel you are on rocky ground with character encodings, Bob Balaban has a good presentation: Multi-Language Character Sets: What They Are, How to Use Them (2001; PDF) [alt link]. He worked on character issues for more than a little while. Chances are that you are not a Lotus Notes developer, so you can skip some of the specifics.
The ICU4J library is a good place to turn if you you're trying to find Unicode (and other I18N) features not included in the Java standard library.
Examples were developed using Java 6.
Hi. Nice article. I've had trouble with, however, the process of reading (first) and writing to file on linux.
ReplyDeleteI have no troubles doing this on windows. And don't understand why.
The file I read, according to firefox, is of character encoding Windows-1250. Which as we know is the default on windows. Whereas UTF-8 is on linux. Which is the main difference, I would think.
But I read in using the same encoding as I write out, like your example. It works on windows, and it doesn't on linux. That is, My umlauts in the text get turned into garbage.
Any idea what I need to look at to make this work on linux?
Hi Sean, I can't tell what the problem is based on your description. Diagnosing character encoding issues can be tricky. I would post the problem on a Q&A site like stackoverflow.com. Pare your code down to the minimum required to reproduce the problem and post the hexadecimal form of the text files your application reads/produces (you can use xxd on Linux produce the hex dump).
ReplyDeleteWould you, by any chance, have an idea on how to make an approximate conversion between encodings?
ReplyDeleteLike, avoiding '?' when a character is unknown.
For example, the character ... in unicode, can be translated as 3 . in ascii.
Thanks in advance
Hi Simon. The Unicode FAQ on Conversions/Mappings doesn't offer much in the way of general advice that will help you.
ReplyDeleteAssuming the ellipsis character is U+2026 (…) then using java.text.Normalizer.Form.NFKD will turn it into three periods (U+002E). However, be aware that normalizing a string this way may transform other characters in ways you don't intend.
Indeed a great post. character encoding is very important when converting byte array to String. I have also blogged my experience as How to find default character encoding or charset in java to point out issues with file.encoding. let me know how do you find it.
ReplyDeleteExcellent! Could you please expandin section [How long is a (piece of) String?](http://illegalargumentexception.blogspot.gr/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_strnlen) as to why for instance U+1D50A is 2 chars (and onew codepoint (and similar for the grapheme that's 4 points). Also in the same section instead of a switch you could use an enum method - return enymInstance.getLength(str) - the switch just blurs the picture I think
ReplyDeleteCode points are unique writing system constructs identified and supported by the Unicode consortium. Consider accented characters like é. In European, Latin-based languages the number of accents (diacritics) available on a letter is usually one at a time. Other writing systems are more complex. So Unicode must allow the building of visible graphemes using more than one diacritic - (e.g. the letter, followed by the accent). There are too many combinations to represent every form. This leads to the situation where there is more than one way to describe é in code points.
DeleteCode points are an abstraction: define a writing system element; give it a number.
Everything must be represented in memory somehow in a computer system. These use cases allocate units of memory (8/16/32 bits) - code units to represent code points - and are called Unicode Transformation Formats. In UTF-8, a code point is 1, 2, 3 or 4 code units. In UTF-16 a code point is 1 or 2 code units. In UTF-32 a code point is a code unit.
Java strings are UTF-16 so there can be a difference between the number of code units and the number of code points.
I encourage reading the latest Unicode specification for more information.
I will leave the code as it is - I hope it describes how to measure the differences clearly enough.