JSON documents are generally encoded using UTF-8 but the format also supports four other encoding forms. This post covers the mechanics of character encoding detection for JSON parsers that don't provide handling for them - for example, Gson and JSON.simple.
EDIT: 2014; a version of this library has been published to Maven central.
- JSON document character encoding
- Detecting JSON encoding in Java
- Sample API
- Reading and writing JSON documents
- Using UTF-16 for efficiency
JSON document character encoding
RFC 4627 gets straight to the point when discussing character encoding:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
Since the first two characters of a JSON text will always be ASCII characters, it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8
UTF-7 doesn't seem to get a look-in, but I've never seen this encoding used anyway. Byte order marks are not supported by this detection mechanism so presumably they are not allowed. This means you should not encode JSON using StandardCharsets.UTF_16.
Detecting JSON encoding in Java
Here is a sample implementation that inpects the first four octets in a buffered stream:
public static Charset detectJsonEncoding(InputStream in) throws IOException, UnsupportedCharsetException { if (!in.markSupported()) { throw new IllegalArgumentException( "InputStream.markSupported returned false"); } in.mark(4); int mask = 0; for (int count = 0; count < 4; count++) { int r = in.read(); if (r == -1) { break; } mask = mask << 1; mask |= (r == 0) ? 0 : 1; } in.reset(); return match(mask); } private static Charset match(int mask) { switch (mask) { case 1: return UTF_32BE(); case 5: return UTF_16BE; case 8: return UTF_32LE(); case 10: return UTF_16LE; default: return UTF_8; } }
As the documentation for Charset notes, Java is only required to support six encodings: US-ASCII; ISO-8859-1; UTF-8; UTF-16BE; UTF-16LE; UTF-16. Only three of the JSON encodings are mandatory. Although code is unlikely to encounder UTF-32 documents, it should guard against the possibility.
Sample API
A ready-made library that implements the above code is available to download: json-utf.
The sources are available from a Subversion repository:
Repository: http://illegalargumentexception.googlecode.com/svn/trunk/code/java/
License: MIT
Project: json-utf
Reading and writing JSON documents
Here is some sample code demonstrating the use of the API to save and load data using Gson:
import java.io.*; import java.lang.reflect.Type; import java.nio.file.Files; import java.nio.file.Path; import java.util.Map; import blog.iae.json.utf.JsonUtf; import com.google.gson.Gson; import com.google.gson.reflect.TypeToken; public final class Settings { private Settings() { } private static final Type TYPE = new TypeToken<Map<String, String>>() { }.getType(); public static void save(Map<String, String> data, Path path) throws IOException { try (OutputStream out = Files.newOutputStream(path); Writer writer = JsonUtf.newJsonWriter(out)) { new Gson().toJson(data, TYPE, writer); } } public static Map<String, String> load(Path path) throws IOException { try (InputStream in = Files.newInputStream(path); Reader reader = JsonUtf.newJsonReader(in)) { return new Gson().fromJson(reader, TYPE); } } }
Using UTF-16 for efficiency
There are some rare conditions under which you might want to encode JSON data as UTF-16. Consider this JSON encoded data:
["こんにちは世界"]
The data will be three bytes shorter when encoded as UTF-16 versus UTF-8.
The sample API includes a writer that will try to guess the most compact encoding by buffering the start of the output:
String[] data = { "\u3053\u3093\u306b\u3061\u306f\u4e16\u754c" }; try (OutputStream out = new FileOutputStream(file); Writer writer = JsonUtf.newCompactJsonWriter(out)) { new Gson().toJson(data, String[].class, writer); }
Note that other forms of compression (like GZIP) may wipe out any gains and there is overhead in both buffering the data and inspecting it. There may also be compatibility issues if data consumers incorrectly assume the use of UTF-8.
No comments:
Post a Comment
All comments are moderated