Saturday 17 April 2010

I18N: comparing character encoding in C, C#, Java, Python and Ruby

Don't assume that the character handling conventions you've learnt in one language/platform will automatically apply in others. I've selected a cross-section of popular languages to contrast the different ways character encoding is handled.

Topics

Character data types

Character data is always going to be in some form of encoding, whether on disk or in RAM. Here's a tabular comparison of some character types in different languages.

Language Type Width (bits) Implicit Encoding
C char 8 implementation specific
wchar_t implementation specific (8+) implementation specific
C# char 16 UTF-16
string 16 (char sequence) UTF-16
Java char 16 UTF-16
String 16 (char sequence) UTF-16
Python str 8 (octet sequence) ASCII (can be changed)
unicode 16 or 32 (code unit sequence) UCS2 (16) or UCS4 (32)
Ruby String 8 (octet sequence) none

Transcoding

Transcoding refers to the act of transforming data from one encoding to another. For example, the code point U+20AC (€) might be transcoded from the byte sequence A4 (ISO-8859-15) to 80 (Windows-1252) or E2 82 AC (UTF-8) or 20 AC (UTF-16BE) or vice-versa.

In order to compare or manipulate strings, it is necessary to transform character data to a common format. This is usually done by defining a common encoding and the means to encode/decode from/to that encoding to/from other encodings.

Underlying encoding issues are identical for all languages; how they are handled in language features and library implementations varies.

All the sample transcoding applications that follow print the hexadecimal byte representation of the euro sign (€) encoded as UTF-8:

e2
82
ac

C

It is tricky to write a truly general description of character handling in C because as much as possible is left to the implementation and the developer. The standard library confines itself to the execution character set. The encoding used by the char is that of the environment under which it is executing, but the type also doubles as a type for working with any byte (octet) data. This is likely one of the sources of confusion that cause developers to assume that one character equals one byte.

C99 defines char to have a range of either -127 to 127 or 0 to 255. That is, it is always 8 bits.

Multibyte string literals of type char are defined with quotes ("foo") whereas wide string literals of type wchar_t are defined with an L prefix (L"bar"). Both literal types support hex escape sequences ("\xFF") and Unicode escape sequences ("\uFFFF" and "\UFFFFFFFF"), though how the Unicode escapes are encoded in the resultant byte sequences is implementation dependent.

The wide character type wchar_t is typically 16 or 32 bits wide and is usually used to define Unicode characters (though doesn't have to). From C99:

The value of a wide character constant containing a single multibyte character that maps to a member of the extended execution character set is the wide character corresponding to that multibyte character, as defined by the mbtowc function, with an implementation-defined current locale.

The intent of wchar_t is to be able to represent a multibyte character as a single value. For example, in the Windows-949 encoding, the Hangul character U+D16E (텮) is represented with the byte sequence B7 41 which requires a width of two chars; conversion to wchar_t would generally allow it to be represented with one wchar_t, but the encoding of both char and wchar_t are implementation details. The standard API does not include functions for transcoding to/from other encodings outside the multibyte and wide execution sets.

Writing portable transcoding code requires the use of a 3rd party library such as ICU4C.

A trivial transcoding operation in C:

#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include "unicode/utypes.h"
#include "unicode/ucnv.h"
#include "unicode/ucnv_err.h"

void assertOk(UErrorCode status)
{
  if(U_FAILURE(status))
  {
    fprintf(stderr, "err == %s\n", u_errorName(status));
  }
  assert(U_SUCCESS(status));
}

int main(int argc, char *argv[])
{
  char iso8859_15[] = { (char) 0xA4, 0x0 }; // euro sign
  // convert to UTF-16
  UErrorCode status = U_ZERO_ERROR;
  UConverter *encoding = ucnv_open("ISO-8859-15", &status);
  assertOk(status);
  int len16 = ucnv_toUChars(encoding, NULL, 0, iso8859_15, -1, &status);
  status = (status == U_BUFFER_OVERFLOW_ERROR) ? U_ZERO_ERROR : status;
  assertOk(status);
  UChar *utf16 = (UChar*) malloc(len16 * sizeof(UChar));
  ucnv_toUChars(encoding, utf16, len16, iso8859_15, -1, &status);
  assertOk(status);
  ucnv_close(encoding);
  // convert to UTF-8
  encoding = ucnv_open("UTF-8", &status);
  assertOk(status);
  int len8 = ucnv_fromUChars(encoding, NULL, 0, utf16, len16, &status);
  status = (status == U_BUFFER_OVERFLOW_ERROR) ? U_ZERO_ERROR : status;
  assertOk(status);
  char *utf8 = (char*) malloc(len8 * sizeof(char));
  ucnv_fromUChars(encoding, utf8, len8, utf16, len16, &status);
  assertOk(status);
  ucnv_close(encoding);
  // print resultant bytes
  for(int i=0; i<len8; i++)
  {
    printf("%02x\n", (unsigned char) utf8[i]);
  }
  // clean up
  free(utf16);
  free(utf8);
  return 0;
}

This code uses the ICU4C library to convert null-terminated ISO-8859-15 encoded bytes first to UTF-16 (as type UChar), then from UTF-16 to UTF-8 encoded bytes. The UChar type is provided by ICU4C.

This information pertains to the C99 standard with ICU4C version 4.2.

C#

Character data is always encoded as UTF-16 in C# and the char type is always 16 bits wide. Other encodings must be represented using the 8 bit byte type. This means that many I/O operations that involve reading/writing character data involve implicit transcoding operations.

A trivial transcoding operation in C#:

using System;
using System.Text;
public class SharpTranscode
{
  public static void Main()
  {
    byte[] iso8859_15 = { 0xA4 }; // euro sign
    char[] utf16 = Encoding.GetEncoding("ISO-8859-15").GetChars(iso8859_15);
    byte[] utf8 = Encoding.UTF8.GetBytes(utf16);
    foreach(byte b in utf8)
      Console.WriteLine(String.Format("{0:x2}", b));
  }
}

The Encoding class also comes with a static Convert method if you don't need the intermediate character data.

Examples were compiled using Mono 2.6.1.

Java

Like C#, Java uses a 16 bit char type with an implicit encoding of UTF-16 and the 8 bit byte type should be used for other encodings.

A trivial transcoding operation in Java:

import java.nio.charset.Charset;
public class JavaTranscode {
  public static void main(String[] args) {
    byte[] iso8859_15 = { (byte) 0xA4 }; // euro sign
    String utf16 = new String(iso8859_15, Charset.forName("ISO-8859-15"));
    byte[] utf8 = utf16.getBytes(Charset.forName("UTF-8"));
    for (byte b : utf8)
      System.out.format("%02x%n", b);
  }
}

The Java version referenced here is 1.6.

Python

Strings in Python come in two types - str and unicode. For example, the literal "foo" is of type str while the literal u"bar" is of type unicode. Like C, Python uses its character type for byte operations.

A trivial transcoding operation in Python:

import array
bytesIso8859_15 = array.array('B', [0xA4]) #euro sign
ucs = bytesIso8859_15.tostring().decode("ISO-8859-15")
utf8 = ucs.encode("UTF-8")
for b in utf8: print "%02x" % ord(b)

The above code is initially a ISO-8859-15 encoded octet sequence that is transformed to a Unicode string before being transformed to a UTF-8 encoded byte sequence. You could write the first transformation more succinctly as ucs = "\xA4".decode("ISO-8859-15"), but I want to emphasise that the byte data may not necessarily be suitable for use in string manipulation.

Depending on how the compilation of the Python runtime was performed, Unicode strings may be backed by 16 or 32 bit "code units." This has consequences for Unicode code points outside the basic multilingual plane (i.e. those with a value higher than 0xFFFF). The code print(len(U"\U0001D50A")) will print 1 on some implementations (e.g. the Ubuntu 9.10 Python implementation) and 2 on others (e.g. the Windows Python install).

Since Python supports supplementary code points like U+1D50A in its string literals, support for 16 bit code units should probably be described as UTF-16 instead of UCS2.

The Python version described here is 2.6.4.

Note: among the incompatible changes in Python 3 are changes to string literals.

Ruby

In Ruby, the String type is an octet sequence with encoding metadata attached. For example, the code puts "\u20AC".encoding prints the encoding of a string literal. Since strings can be of different encodings, care is needed to handle character data consistently - for example, "\u20AC".encode("UTF-8") == "\u20AC".encode("ISO-8859-15") evaluates to false.

A trivial transcoding operation in Ruby:

bytesIso8859_15 = [ 0xA4 ] # euro sign
utf8 = bytesIso8859_15.pack('c*').encode("UTF-8", "ISO-8859-15")
utf8.each_byte {|b| puts "%02x" % b }

The above code transforms ISO-8859-15 encoded bytes to a UTF-8 encoded string. Since Ruby does not use an implicit character encoding, there is no obvious intermediate encoding step (though the API uses UTF-8 as an intermediate internally).

The Ruby version described here is MRI 1.9.

Encoding support in Ruby 1.8 was more primitive and some of the information presented here does not apply.

Notes

In the trivial examples, string classes have been used to perform encoding operations. This is OK for whole data, but can result in character corruption when working with data streams. When transcoding data from a stream, it is generally better to use a transcoding stream type like UFILE (ICU4C), TextReader (C#/.Net), Reader (Java), TextIOBase (Python), or by setting the encoding mode (Ruby). Similar mechanisms exist for output.

Resources

2 comments:

  1. I always had this question.Is a w_char or Unicode in Windows UTF16?All the documentation refers to w_char as fixed size 2 bytes which it looks more than UCS2, since UTF16 is variable size 2-4 bytes.

    ReplyDelete
  2. In UTF-16, a code unit is 16 bits (2 bytes). Unicode code points (or characters) are composed of one or two code units. The code unit values U+D800-U+DBFF and U+DC00-U+DFFF are reserved to form surrogate pairs (4 byte sequences).

    UTF-16 is a variable-width encoding (like UTF-8 or some of the legacy Asian character sets).

    UCS2 is restricted to the single code unit characters. In UTF-16, this subset is known as the Basic Multilingual Plane.

    The upshot of this is that a single Unicode code point (character) is not equivalent to a single code unit. To represent the full Unicode range in single code units, you need to use UTF-32.

    More details in the Unicode FAQ on UTF-16.

    The upshot is that the number of wchar_ts in a string may not equal the number of characters (but then the same is often true for char too).

    ReplyDelete

All comments are moderated