Don't assume that the character handling conventions you've learnt in one language/platform will automatically apply in others. I've selected a cross-section of popular languages to contrast the different ways character encoding is handled.
Topics
Character data types
Character data is always going to be in some form of encoding, whether on disk or in RAM. Here's a tabular comparison of some character types in different languages.
Language | Type | Width (bits) | Implicit Encoding |
---|---|---|---|
C | char | 8 | implementation specific | wchar_t | implementation specific (8+) | implementation specific |
C# | char | 16 | UTF-16 |
string | 16 (char sequence) | UTF-16 | |
Java | char | 16 | UTF-16 |
String | 16 (char sequence) | UTF-16 | |
Python | str | 8 (octet sequence) | ASCII (can be changed) |
unicode | 16 or 32 (code unit sequence) | UCS2 (16) or UCS4 (32) | |
Ruby | String | 8 (octet sequence) | none |
Transcoding
Transcoding refers to the act of transforming data from one
encoding to another. For example, the code point U+20AC (€) might
be transcoded from the byte sequence A4
(ISO-8859-15)
to 80
(Windows-1252)
or E2 82 AC
(UTF-8)
or 20 AC
(UTF-16BE) or
vice-versa.
In order to compare or manipulate strings, it is necessary to transform character data to a common format. This is usually done by defining a common encoding and the means to encode/decode from/to that encoding to/from other encodings.
Underlying encoding issues are identical for all languages; how they are handled in language features and library implementations varies.
All the sample transcoding applications that follow print the hexadecimal byte representation of the euro sign (€) encoded as UTF-8:
e2 82 ac
C
It is tricky to write a truly general description of character
handling in C because as much as possible is left to the implementation
and the developer. The standard library confines itself to the execution
character set. The encoding used by the char
is that of the
environment under which it is executing, but the type also doubles as a
type for working with any byte (octet) data. This is likely one of the
sources of confusion that cause developers to assume that one character
equals one byte.
C99 defines char
to have a range of either -127 to
127 or 0 to 255. That is, it is always 8 bits.
Multibyte string literals of type char
are
defined with quotes ("foo"
) whereas wide string
literals of type wchar_t
are defined with an L
prefix (L"bar"
). Both literal types support hex escape
sequences ("\xFF"
) and Unicode escape sequences ("\uFFFF"
and "\UFFFFFFFF"
), though how the Unicode escapes are
encoded in the resultant byte sequences is implementation dependent.
The wide character type wchar_t
is typically 16 or
32 bits wide and is usually used to define Unicode characters (though
doesn't have to). From C99:
The value of a wide character constant
containing a single multibyte character that maps to a member of the
extended execution character set is the wide character corresponding to
that multibyte character, as defined by the mbtowc
function, with an implementation-defined current locale.
The intent of wchar_t
is to be able to represent a
multibyte character as a single value. For example, in the Windows-949
encoding, the Hangul character U+D16E (텮) is represented with the
byte sequence B7 41
which requires a width of two char
s;
conversion to wchar_t
would generally allow it to be
represented with one wchar_t
, but the encoding of both char
and wchar_t
are implementation details. The standard API does not include functions
for transcoding to/from other encodings outside the multibyte and wide
execution sets.
Writing portable transcoding code requires the use of a 3rd party library such as ICU4C.
A trivial transcoding operation in C:
#include <stdio.h> #include <assert.h> #include <stdlib.h> #include "unicode/utypes.h" #include "unicode/ucnv.h" #include "unicode/ucnv_err.h" void assertOk(UErrorCode status) { if(U_FAILURE(status)) { fprintf(stderr, "err == %s\n", u_errorName(status)); } assert(U_SUCCESS(status)); } int main(int argc, char *argv[]) { char iso8859_15[] = { (char) 0xA4, 0x0 }; // euro sign // convert to UTF-16 UErrorCode status = U_ZERO_ERROR; UConverter *encoding = ucnv_open("ISO-8859-15", &status); assertOk(status); int len16 = ucnv_toUChars(encoding, NULL, 0, iso8859_15, -1, &status); status = (status == U_BUFFER_OVERFLOW_ERROR) ? U_ZERO_ERROR : status; assertOk(status); UChar *utf16 = (UChar*) malloc(len16 * sizeof(UChar)); ucnv_toUChars(encoding, utf16, len16, iso8859_15, -1, &status); assertOk(status); ucnv_close(encoding); // convert to UTF-8 encoding = ucnv_open("UTF-8", &status); assertOk(status); int len8 = ucnv_fromUChars(encoding, NULL, 0, utf16, len16, &status); status = (status == U_BUFFER_OVERFLOW_ERROR) ? U_ZERO_ERROR : status; assertOk(status); char *utf8 = (char*) malloc(len8 * sizeof(char)); ucnv_fromUChars(encoding, utf8, len8, utf16, len16, &status); assertOk(status); ucnv_close(encoding); // print resultant bytes for(int i=0; i<len8; i++) { printf("%02x\n", (unsigned char) utf8[i]); } // clean up free(utf16); free(utf8); return 0; }
This code uses the ICU4C library to convert null-terminated
ISO-8859-15 encoded bytes first to UTF-16 (as type UChar
),
then from UTF-16 to UTF-8 encoded bytes. The UChar
type is
provided by ICU4C.
This information pertains to the C99 standard with ICU4C version 4.2.
- JTC1/SC22/WG14 - C (C standard working group)
- ICU - International Components for Unicode
C#
Character data is always encoded as UTF-16 in C# and the char
type is always 16 bits wide. Other encodings must be represented using
the 8 bit byte
type. This means that many I/O operations
that involve reading/writing character data involve implicit transcoding
operations.
A trivial transcoding operation in C#:
using System; using System.Text; public class SharpTranscode { public static void Main() { byte[] iso8859_15 = { 0xA4 }; // euro sign char[] utf16 = Encoding.GetEncoding("ISO-8859-15").GetChars(iso8859_15); byte[] utf8 = Encoding.UTF8.GetBytes(utf16); foreach(byte b in utf8) Console.WriteLine(String.Format("{0:x2}", b)); } }
The Encoding
class also comes with a static Convert
method if you don't need the intermediate character data.
Examples were compiled using Mono 2.6.1.
Java
Like C#, Java uses a 16 bit char
type with an
implicit encoding of UTF-16 and the 8 bit byte
type should
be used for other encodings.
A trivial transcoding operation in Java:
import java.nio.charset.Charset; public class JavaTranscode { public static void main(String[] args) { byte[] iso8859_15 = { (byte) 0xA4 }; // euro sign String utf16 = new String(iso8859_15, Charset.forName("ISO-8859-15")); byte[] utf8 = utf16.getBytes(Charset.forName("UTF-8")); for (byte b : utf8) System.out.format("%02x%n", b); } }
The Java version referenced here is 1.6.
- The Java Language Specification
- The Java Virtual Machine Specification
- JDK 6
Documentation
- Character class
- String class
- java.nio.charset package
- Supported Encodings
Python
Strings in Python come in two types - str
and unicode
.
For example, the literal "foo"
is of type str
while the literal u"bar"
is of type unicode
.
Like C, Python uses its character type for byte operations.
A trivial transcoding operation in Python:
import array bytesIso8859_15 = array.array('B', [0xA4]) #euro sign ucs = bytesIso8859_15.tostring().decode("ISO-8859-15") utf8 = ucs.encode("UTF-8") for b in utf8: print "%02x" % ord(b)
The above code is initially a ISO-8859-15 encoded octet sequence
that is transformed to a Unicode string before being transformed to a
UTF-8 encoded byte sequence. You could write the first transformation
more succinctly as ucs = "\xA4".decode("ISO-8859-15")
,
but I want to emphasise that the byte data may not necessarily be
suitable for use in string manipulation.
Depending on how the compilation of the Python runtime was
performed, Unicode strings may be backed by 16 or 32 bit "code units."
This has consequences for Unicode code points outside the basic
multilingual plane (i.e. those with a value higher than 0xFFFF). The
code print(len(U"\U0001D50A"))
will print 1
on
some implementations (e.g. the Ubuntu 9.10 Python implementation) and 2
on others (e.g. the Windows Python install).
Since Python supports supplementary code points like U+1D50A in its string literals, support for 16 bit code units should probably be described as UTF-16 instead of UCS2.
The Python version described here is 2.6.4.
- The
Python Language Reference
- Lexical analysis: Literals
- The Python Standard Library
Note: among the incompatible changes in Python 3 are changes to string literals.
Ruby
In Ruby, the String
type is an octet sequence with
encoding metadata attached. For example, the code puts "\u20AC".encoding
prints the encoding of a string literal. Since strings can be of
different encodings, care is needed to handle character data
consistently - for example, "\u20AC".encode("UTF-8") == "\u20AC".encode("ISO-8859-15")
evaluates to false.
A trivial transcoding operation in Ruby:
bytesIso8859_15 = [ 0xA4 ] # euro sign utf8 = bytesIso8859_15.pack('c*').encode("UTF-8", "ISO-8859-15") utf8.each_byte {|b| puts "%02x" % b }
The above code transforms ISO-8859-15 encoded bytes to a UTF-8 encoded string. Since Ruby does not use an implicit character encoding, there is no obvious intermediate encoding step (though the API uses UTF-8 as an intermediate internally).
The Ruby version described here is MRI 1.9.
Encoding support in Ruby 1.8 was more primitive and some of the information presented here does not apply.
Notes
In the trivial examples, string classes have been used to perform encoding operations. This is OK for whole data, but can result in character corruption when working with data streams. When transcoding data from a stream, it is generally better to use a transcoding stream type like UFILE (ICU4C), TextReader (C#/.Net), Reader (Java), TextIOBase (Python), or by setting the encoding mode (Ruby). Similar mechanisms exist for output.
I always had this question.Is a w_char or Unicode in Windows UTF16?All the documentation refers to w_char as fixed size 2 bytes which it looks more than UCS2, since UTF16 is variable size 2-4 bytes.
ReplyDeleteIn UTF-16, a code unit is 16 bits (2 bytes). Unicode code points (or characters) are composed of one or two code units. The code unit values U+D800-U+DBFF and U+DC00-U+DFFF are reserved to form surrogate pairs (4 byte sequences).
ReplyDeleteUTF-16 is a variable-width encoding (like UTF-8 or some of the legacy Asian character sets).
UCS2 is restricted to the single code unit characters. In UTF-16, this subset is known as the Basic Multilingual Plane.
The upshot of this is that a single Unicode code point (character) is not equivalent to a single code unit. To represent the full Unicode range in single code units, you need to use UTF-32.
More details in the Unicode FAQ on UTF-16.
The upshot is that the number of wchar_ts in a string may not equal the number of characters (but then the same is often true for char too).