By default, Java encodes Strings sent to System.out
in the default code page. On Windows XP, this means a lossy conversion
to an "ANSI" code page. This is unfortunate, because the Windows Command
Prompt (cmd.exe
) can read and write Unicode characters. This post describes
how to use JNA to work round
this problem.
This post is a follow-up to I18N: Unicode at the Windows command prompt (C++; .Net; Java), so you might want to read that first.
To get this code to support Unicode on Windows XP, you'll need to switch your console to a Unicode font (e.g. Lucida Console on the English language version).
JNA
The Java Native Access API allows you to call native APIs without needing to resort to writing/compiling native interface code using JNI. You can just write everything in Java.
It isn't all a walk in the park, though. You'll need to understand the API you're calling and understand the mapping between Java and native types. Prior C/C++ experience is a definite plus. For native constant values, you may need to either read API header files or write a short application that emits them.
- JNA documentation
- Java primitive types (and their object equivalents) map directly to the native C type of the same size
- MSDN Library: Win32 and COM Development
Here is an example mapping for WriteConsole:
C++ declaration from API doc | Java interface declaration |
---|---|
BOOL WINAPI WriteConsole( __in HANDLE hConsoleOutput, __in const VOID *lpBuffer, __in DWORD nNumberOfCharsToWrite, __out LPDWORD lpNumberOfCharsWritten, __reserved LPVOID lpReserved ); |
public boolean WriteConsoleW( Pointer hConsoleOutput, char[] lpBuffer, int nNumberOfCharsToWrite, IntByReference lpNumberOfCharsWritten, Pointer lpReserved ); |
The C++ API offers three ways to call the function. We opt for WriteConsoleW
because we explicitly want to use Unicode (16-bit C++ wchar_t
types). The alternative is to use an "ANSI" call (WriteConsoleA
,
requiring Java byte
arrays instead of Java char
arrays) or pass options to the library initialisation to declare which
one WriteConsole
should delegate to.
The mappings I've used are not the only ones that will work. The JNA code samples show how to create a HANDLE type to use instead of Pointer, for example.
Once you've defined an interface with the appropriate mapping methods, creating an instance is trivial:
public static Kernel32 INSTANCE = (Kernel32) Native |
Note: all the examples below are single-threaded and do not synchronise access to the native API. JNA is licensed under the LGPL, which may not suit everyone. The examples that follow should still be useful in understanding the requirements for a JNI implementation.
Getting the console mode
The console mode defines the screen buffer's input/output modes (e.g. whether is using insert or overwrite input). These examples don't change these modes, but it is still useful to call this function. It will return false if the handle is being redirected (e.g. if stdout is piped to a file, it will return false).
private static final Kernel32 KERNEL32 = Kernel32.INSTANCE;
|
Setting and getting the console code page
The console code page can be changed to match the input/output
encodings used by Java. Values are Windows
code page identifiers. Equivalent get functions can read existing
values. In some circumstances, like invoking the function in the absence
of a console, you might read a value of zero. It is probably no
coincidence that this is also the value of the constant CP_ACP
(system default Windows ANSI code page).
private static void setCodePage(String cp) {
|
Writing Unicode to the console
Writing Unicode to the console is simple enough (assuming you've remembered to switch to a Unicode font and the font includes the graphemes you want to display). It doesn't matter which output code page has been set - that information is only required when working with "ANSI"/multibyte characters.
public static void writeToConsole(Pointer hConsoleOutput,
|
Working with "ANSI"/multibyte characters is something
you will have to think about. If the output is redirected to a file, you
can't use the WriteConsole
function. You will need to test
the console mode.
private static void print(int nStdHandle, String message) {
|
There are a number of circumstances when writing to the console
using WriteConsole
cannot be used. If you run this code
under the Eclipse IDE, for example, GetConsoleMode
returns
false
.
Reading Unicode from the console
The ReadConsole
function can be used to get Unicode
input. The default console mode will let the user enter characters until
the ENTER key is pressed. On Windows, line terminators are marked by a
carriage return followed by a linefeed (\r\n
).
public static String readFromConsole(Pointer hConsoleInput) {
|
Again, the ReadConsole
function can only be used if
GetConsoleMode
returns true
.
Encoding to and decoding from the console code page
Encoding/decoding between wide chars and bytes could be done in Java, but that would require mapping the code page identifiers to Java encodings. It is more convenient to pass the code page to the native function.
public static byte[] encode(int codePage, String message) {
|
public static String decode(int codePage,
|
The functions are invoked twice: the first time to calculate the output buffer size; the second to fill the buffer.
Printing characters as UTF-8
One other way to get the console to emit Unicode characters is to set its code page to UTF-8.
private static final int UTF_8 = 65001;
|
You can then encode and emit the bytes, treating the console handle like a file handle.
public static void writeToFile(Pointer hFile,
|
It doesn't appear to be possible to read Unicode characters in
UTF-8 mode (i.e. by calling SetConsoleCP
). This appears to
be a limitation of cmd.exe
(it doesn't work with the .Net Console.ReadLine
method either).
Code
All the sources are available in a public Subversion repository.
Repository:
http://illegalargumentexception.googlecode.com/svn/trunk/code/java/
License: MIT
Project: Win32UnicodeConsole
Notes
Because the code will usually operate differently when run from an IDE, you may want to look into remote debugging.
Find funny class sun.misc.SharedSecrets - looks like via sun.misc.SharedSecrets.getJavaIOAccess().charset() we can get real console codepage. Also see java.io.Console - new class, which appeared at java 1.6
ReplyDeleteInteresting! I hadn't realized that java.io.Console did anything different to System.out.
ReplyDeleteString test = "\u00A3\u044F";
System.console().printf("%s%n", test);
System.out.println(test);
The above code, using a default cmd.exe on English Windows (CP850) will print U+00A3 (£) correctly via System.console() but not System.out.
Unfortunately, it doesn't help us with U+044F (я) [after switching to a TrueType font]. It lies outside both my native OEM and ANSI character sets (850/1252). java.io.Console doesn't seem to like the console being switched to UTF-8 (chdp 65001) either.
Still, this is good to know.
You are right. java.io.Console is not new-cool-really-utf8-enabled console - this is just more useful wrapper instead of System.out.
ReplyDeleteAnd this class can correctly determine underlying codepage. So if we set console codepage to 65001 - we can output not only "я", but even "我" (of course, if we have correct chinese font :) ).
But all this cool if we have java 1.6 :)
I had f*cking bug when we need to output messages in different languages to this f*cking windows cmd.exe (linux bash too, but it had less problems - $LANG rocks), so I was forced to implement similar native code for determining f*cking OEM codepage.
FWIW runtests.bat fails with JDK7.
ReplyDeletee:\src\Win32UnicodeConsole>runtests.bat
INFO: showing code page
Input CP: 437
Output CP: 437
INFO: trying to set invalid CP
The parameter is incorrect.
INFO: setting code page to 1252
Input CP: 1252
Output CP: 1252
Changed to CP: 1252
Press any key to continue . . .
INFO: testing console mode
Console mode: 3
INFO: redirecting stdout and testing mode
Can't get console mode; reason: The handle is invalid.
JNA has moved to https://github.com/twall/jna ...
ReplyDelete