Java: "Content is not allowed in prolog" - causes of this XML processing error

Thursday, 16 September 2010

Java: "Content is not allowed in prolog" - causes of this XML processing error

Content is not allowed in prolog is an error generally emitted by the Java XML parsers when data is encountered before the <?xml... declaration. You may inspect the document in a text editor and think nothing is wrong, but you need to go down to the byte level to understand the problem. You probably have a character encoding bug.

This code reproduces the problem:

 import java.io.*;

    import java.nio.charset.Charset;

    import javax.xml.parsers.*;

    import org.xml.sax.SAXException;

    import org.xml.sax.helpers.DefaultHandler;

    

    public class ContentNotAllowedInProlog {

      private static void parse(InputStream stream) throws SAXException,

          ParserConfigurationException, IOException {

        SAXParserFactory.newInstance().newSAXParser().parse(stream,

            new DefaultHandler());

      }

    

      public static void main(String[] args) {

        String[] encodings = { "UTF-8", "UTF-16", "ISO-8859-1" };

        for (String actual : encodings) {

          for (String declared : encodings) {

            if (actual != declared) {

              String xml = "<?xml version='1.0' encoding='" + declared

                  + "'?><x/>";

              byte[] encoded = xml.getBytes(Charset.forName(actual));

              try {

                parse(new ByteArrayInputStream(encoded));

                System.out.println("HIDDEN ERROR! actual:" + actual + " " + xml);

              } catch (Exception e) {

                System.out.println(e.getMessage() + " actual:" + actual + " xml:"

                    + xml);

              }

            }

          }

        }

      }

    }

The output:

Content is not allowed in prolog. actual:UTF-8 xml:<?xml version='1.0' encoding='UTF-16'?><x/>
HIDDEN ERROR! actual:UTF-8 <?xml version='1.0' encoding='ISO-8859-1'?><x/>
Content is not allowed in prolog. actual:UTF-16 xml:<?xml version='1.0' encoding='UTF-8'?><x/>
Content is not allowed in prolog. actual:UTF-16 xml:<?xml version='1.0' encoding='ISO-8859-1'?><x/>
HIDDEN ERROR! actual:ISO-8859-1 <?xml version='1.0' encoding='UTF-8'?><x/>
Content is not allowed in prolog. actual:ISO-8859-1 xml:<?xml version='1.0' encoding='UTF-16'?><x/>

This code also highlights another, more insidious character encoding issue - when we can accidentally encode with one encoding thinking it is another and everything seems to work.

When you inspect the data in a hex editor problems become more apparent.

A valid UTF-16 form:

FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00         __<_?_x_m_l_ _v_
65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 27 00         e_r_s_i_o_n_=_'_
31 00 2E 00 30 00 27 00 20 00 65 00 6E 00 63 00         1_._0_'_ _e_n_c_
6F 00 64 00 69 00 6E 00 67 00 3D 00 27 00 55 00         o_d_i_n_g_=_'_U_
54 00 46 00 2D 00 31 00 36 00 27 00 3F 00 3E 00         T_F_-_1_6_'_?_>_
3C 00 78 00 2F 00 3E 00                                 <_x_/_>_

Note: exact UTF-16 byte forms vary - big-endian, little-endian, with or without a byte-order-mark. This one is little-endian with a BOM.

An XML document that declares itself as UTF-16 but is really UTF-8:

EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E         ___<?xml version
3D 27 31 2E 30 27 20 65 6E 63 6F 64 69 6E 67 3D         ='1.0' encoding=
27 55 54 46 2D 31 36 27 3F 3E 3C 78 2F 3E               'UTF-16'?><x />

Note: UTF-8 XML documents can come with or without a byte-order-mark. This one includes a BOM.

XML, Java and Encodings

The code was written and tested against Sun's win32 Java 1.6.0_17 which uses a version of the Apache Xerces parser internally.

Illegal Argument Exception

Thursday, 16 September 2010