Friday, 10 April 2009

Java: Unicode on the Windows command line

By default, Java encodes Strings sent to System.out in the default code page. On Windows XP, this means a lossy conversion to an "ANSI" code page. This is unfortunate, because the Windows Command Prompt (cmd.exe) can read and write Unicode characters. This post describes how to use JNA to work round this problem.

This post is a follow-up to I18N: Unicode at the Windows command prompt (C++; .Net; Java), so you might want to read that first.

To get this code to support Unicode on Windows XP, you'll need to switch your console to a Unicode font (e.g. Lucida Console on the English language version).

JNA

The Java Native Access API allows you to call native APIs without needing to resort to writing/compiling native interface code using JNI. You can just write everything in Java.

It isn't all a walk in the park, though. You'll need to understand the API you're calling and understand the mapping between Java and native types. Prior C/C++ experience is a definite plus. For native constant values, you may need to either read API header files or write a short application that emits them.

Here is an example mapping for WriteConsole:

C++ declaration from API doc Java interface declaration
BOOL WINAPI WriteConsole(
  __in        HANDLE hConsoleOutput,
  __in        const VOID *lpBuffer,
  __in        DWORD nNumberOfCharsToWrite,
  __out       LPDWORD lpNumberOfCharsWritten,
  __reserved  LPVOID lpReserved
);
public boolean WriteConsoleW(
            Pointer hConsoleOutput,
            char[] lpBuffer,
            int nNumberOfCharsToWrite,
            IntByReference lpNumberOfCharsWritten,
            Pointer lpReserved
);

The C++ API offers three ways to call the function. We opt for WriteConsoleW because we explicitly want to use Unicode (16-bit C++ wchar_t types). The alternative is to use an "ANSI" call (WriteConsoleA, requiring Java byte arrays instead of Java char arrays) or pass options to the library initialisation to declare which one WriteConsole should delegate to.

The mappings I've used are not the only ones that will work. The JNA code samples show how to create a HANDLE type to use instead of Pointer, for example.

Once you've defined an interface with the appropriate mapping methods, creating an instance is trivial:

  public static Kernel32 INSTANCE = (Kernel32Native
      .loadLibrary("Kernel32", Kernel32.class);

Note: all the examples below are single-threaded and do not synchronise access to the native API. JNA is licensed under the LGPL, which may not suit everyone. The examples that follow should still be useful in understanding the requirements for a JNI implementation.

Getting the console mode

The console mode defines the screen buffer's input/output modes (e.g. whether is using insert or overwrite input). These examples don't change these modes, but it is still useful to call this function. It will return false if the handle is being redirected (e.g. if stdout is piped to a file, it will return false).

  private static final Kernel32 KERNEL32 = Kernel32.INSTANCE;

  public static void main(String[] args) {
    Pointer hConsoleHandle = KERNEL32
        .GetStdHandle(Kernel32.STD_OUTPUT_HANDLE);
    IntByReference lpMode = new IntByReference();
    if (KERNEL32.GetConsoleMode(hConsoleHandle, lpMode)) {
      System.out.println("Console mode: "
          + lpMode.getValue());
    else {
      System.err.println("Can't get console mode; reason: "
          + W32Utils.getLastError());
    }
  }

Setting and getting the console code page

The console code page can be changed to match the input/output encodings used by Java. Values are Windows code page identifiers. Equivalent get functions can read existing values. In some circumstances, like invoking the function in the absence of a console, you might read a value of zero. It is probably no coincidence that this is also the value of the constant CP_ACP (system default Windows ANSI code page).

  private static void setCodePage(String cp) {
    int wCodePageID;
    try {
      wCodePageID = Integer.parseInt(cp);
    catch (NumberFormatException e) {
      System.err
          .println("Value must be MS code page identifier");
      return;
    }

    if (!KERNEL32.SetConsoleCP(wCodePageID)) {
      printLastError();
      return;
    }

    if (!KERNEL32.SetConsoleOutputCP(wCodePageID)) {
      printLastError();
      return;
    }

    reportCodePage();
    System.out.println("Changed to CP:\t" + wCodePageID);
  }

Writing Unicode to the console

Writing Unicode to the console is simple enough (assuming you've remembered to switch to a Unicode font and the font includes the graphemes you want to display). It doesn't matter which output code page has been set - that information is only required when working with "ANSI"/multibyte characters.

  public static void writeToConsole(Pointer hConsoleOutput,
      String message) {
    char[] lpBuffer = message.toCharArray();
    IntByReference lpNumberOfCharsWritten = new IntByReference();
    if (!KERNEL32.WriteConsoleW(hConsoleOutput, lpBuffer,
        lpBuffer.length, lpNumberOfCharsWritten, null)) {
      String error = W32Utils.getLastError();
      throw new IllegalStateException(error);
    }
  }

Working with "ANSI"/multibyte characters is something you will have to think about. If the output is redirected to a file, you can't use the WriteConsole function. You will need to test the console mode.

  private static void print(int nStdHandle, String message) {
    Pointer hConsoleOutput = KERNEL32
        .GetStdHandle(nStdHandle);
    IntByReference lpMode = new IntByReference();
    if (KERNEL32.GetConsoleMode(hConsoleOutput, lpMode)) {
      W32Utils.writeToConsole(hConsoleOutput, message);
    else {
      int codePage = Kernel32.CP_ACP; // default system ANSI CP
      W32Utils.writeToFile(hConsoleOutput, codePage,
          message);
    }
  }

  /** instead of System.out.print */
  private static void stdout(String message) {
    print(Kernel32.STD_OUTPUT_HANDLE, message);
  }

There are a number of circumstances when writing to the console using WriteConsole cannot be used. If you run this code under the Eclipse IDE, for example, GetConsoleMode returns false.

Reading Unicode from the console

The ReadConsole function can be used to get Unicode input. The default console mode will let the user enter characters until the ENTER key is pressed. On Windows, line terminators are marked by a carriage return followed by a linefeed (\r\n).

  public static String readFromConsole(Pointer hConsoleInput) {
    StringBuilder line = new StringBuilder();
    char[] lpBuffer = new char[128];
    IntByReference lpNumberOfCharsRead = new IntByReference();
    while (true) {
      if (!Kernel32.INSTANCE.ReadConsoleW(hConsoleInput,
          lpBuffer, lpBuffer.length, lpNumberOfCharsRead,
          null)) {
        String errMsg = getLastError();
        throw new IllegalStateException(errMsg);
      }
      int len = lpNumberOfCharsRead.getValue();
      line.append(lpBuffer, 0, len);
      if (lpBuffer[len - 1== '\n') {
        break;
      }
    }
    return line.toString();
  }

Again, the ReadConsole function can only be used if GetConsoleMode returns true.

Encoding to and decoding from the console code page

Encoding/decoding between wide chars and bytes could be done in Java, but that would require mapping the code page identifiers to Java encodings. It is more convenient to pass the code page to the native function.

  public static byte[] encode(int codePage, String message) {
    char[] lpWideCharStr = message.toCharArray();
    int lenA = KERNEL32.WideCharToMultiByte(codePage, 0,
        lpWideCharStr, lpWideCharStr.length, null, 0, null,
        null);
    byte[] lpMultiByteStr = new byte[lenA];
    KERNEL32.WideCharToMultiByte(codePage, 0,
        lpWideCharStr, lpWideCharStr.length,
        lpMultiByteStr, lpMultiByteStr.length, null, null);
    return lpMultiByteStr;
  }
  public static String decode(int codePage,
      byte[] lpMultiByteStr) {
    int lenW = KERNEL32.MultiByteToWideChar(codePage, 0,
        lpMultiByteStr, lpMultiByteStr.length, null, 0);
    char[] lpWideCharStr = new char[lenW];
    KERNEL32.MultiByteToWideChar(codePage, 0,
        lpMultiByteStr, lpMultiByteStr.length,
        lpWideCharStr, lpWideCharStr.length);
    return new String(lpWideCharStr);
  }

The functions are invoked twice: the first time to calculate the output buffer size; the second to fill the buffer.

Printing characters as UTF-8

One other way to get the console to emit Unicode characters is to set its code page to UTF-8.

  private static final int UTF_8 = 65001;

  public static void main(String[] args) {
    int oldOutputCodePage = KERNEL32.GetConsoleOutputCP();
    try {
      KERNEL32.SetConsoleOutputCP(UTF_8);

      Pointer stdout = KERNEL32
          .GetStdHandle(Kernel32.STD_OUTPUT_HANDLE);

      W32Utils.writeToFile(stdout, UTF_8,
          "Cyrillic Ya: \u044F\r\n");
    finally {
      KERNEL32.SetConsoleOutputCP(oldOutputCodePage);
    }
  }

You can then encode and emit the bytes, treating the console handle like a file handle.

  public static void writeToFile(Pointer hFile,
      int codePage, String message) {
    byte[] lpBuffer = encode(codePage, message);
    IntByReference written = new IntByReference();
    if (!KERNEL32.WriteFile(hFile, lpBuffer,
        lpBuffer.length, written, null)) {
      String error = W32Utils.getLastError();
      throw new IllegalStateException(error);
    }
  }

It doesn't appear to be possible to read Unicode characters in UTF-8 mode (i.e. by calling SetConsoleCP). This appears to be a limitation of cmd.exe (it doesn't work with the .Net Console.ReadLine method either).

Code

All the sources are available in a public Subversion repository.

Repository: http://illegalargumentexception.googlecode.com/svn/trunk/code/java/
License: MIT
Project: Win32UnicodeConsole

Notes

Because the code will usually operate differently when run from an IDE, you may want to look into remote debugging.

5 comments:

  1. Find funny class sun.misc.SharedSecrets - looks like via sun.misc.SharedSecrets.getJavaIOAccess().charset() we can get real console codepage. Also see java.io.Console - new class, which appeared at java 1.6

    ReplyDelete
  2. Interesting! I hadn't realized that java.io.Console did anything different to System.out.

    String test = "\u00A3\u044F";
    System.console().printf("%s%n", test);
    System.out.println(test);

    The above code, using a default cmd.exe on English Windows (CP850) will print U+00A3 (£) correctly via System.console() but not System.out.

    Unfortunately, it doesn't help us with U+044F (я) [after switching to a TrueType font]. It lies outside both my native OEM and ANSI character sets (850/1252). java.io.Console doesn't seem to like the console being switched to UTF-8 (chdp 65001) either.

    Still, this is good to know.

    ReplyDelete
  3. You are right. java.io.Console is not new-cool-really-utf8-enabled console - this is just more useful wrapper instead of System.out.
    And this class can correctly determine underlying codepage. So if we set console codepage to 65001 - we can output not only "я", but even "我" (of course, if we have correct chinese font :) ).
    But all this cool if we have java 1.6 :)
    I had f*cking bug when we need to output messages in different languages to this f*cking windows cmd.exe (linux bash too, but it had less problems - $LANG rocks), so I was forced to implement similar native code for determining f*cking OEM codepage.

    ReplyDelete
  4. FWIW runtests.bat fails with JDK7.

    e:\src\Win32UnicodeConsole>runtests.bat
    INFO: showing code page
    Input CP: 437
    Output CP: 437
    INFO: trying to set invalid CP
    The parameter is incorrect.

    INFO: setting code page to 1252
    Input CP: 1252
    Output CP: 1252
    Changed to CP: 1252
    Press any key to continue . . .
    INFO: testing console mode
    Console mode: 3
    INFO: redirecting stdout and testing mode
    Can't get console mode; reason: The handle is invalid.

    ReplyDelete
  5. JNA has moved to https://github.com/twall/jna ...

    ReplyDelete

All comments are moderated