Thursday, 31 July 2008

Java: determining the version of an XML document

When working with XML in Java, it is not uncommon to want to work with the data using strongly typed objects that enforce the document structure. That is, you want to use Java Beans rather than a Document Object Model (DOM).

A number of technologies can be used to automate this transformation, such as the Apache Commons Digester (a rules-based entity mapper) or XMLBeans (which provides schema-based bean generation).

Document Structure and Versioning

Because you are using static typing, there is an implicit understanding that the document structure is going to be fixed. For most applications, this is fine, except when a new version of the document is introduced and the code is required to handle the old version as well as the new one. The new structure may be a superset of the old one or it may have structural changes that make it incompatible with the old one. Either way, the programmer needs to determine the version and handle it accordingly.

For real-world examples, we'll use the Java web application deployment descriptors. These can be found in J2EE WAR files (WEB-INF/web.xml) and define servlet mappings, access control and so on. Typical file signatures are listed below, starting at version 2.2 (the version that introduced web.xml). Notice that when version 2.4 was introduced, the platform switched from document type definitions (DTD) to schemas (XSD) for validation.

web.xml 2.2

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.2//EN"

web.xml 2.3

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE web-app
    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"

web.xml 2.4

<?xml version="1.0" encoding="UTF-8"?>
<web-app id="WebApp_ID"

web.xml 2.5

<?xml version="1.0" encoding="UTF-8"?>
<web-app xmlns:xsi=""
    id="WebApp_ID" version="2.5">


  • Mix of XSD and DTD means that the same mechanism cannot be used to determine version in all files.
  • The file has to be parsed twice - once to determine the version; again when you pass the file off to whatever framework you're using to instantiate the beans.
  • Versioning is domain dependent; there is no universal solution to determining the version of an XML document.

Using the SAX Parser to Determine web.xml Version

Since the SAX parser comes with the standard Java library (since version 1.4), it is probably the best bet for determining the version. The code that follows will print out the information that can be used to determine the file version.

(EDIT: actually, StAX would be a better option than SAX if you're on Java SE 6, Java EE 5 or above. It lets you avoid parsing the entire document.)

  public static void printVersionInfo(InputStream stream)
      throws SAXException, ParserConfigurationException, IOException {
    /**Anonymous handler class with 2 overridden methods*/
    DefaultHandler handler = new DefaultHandler() {
      public void startElement(String uri, String localName, String name,
          Attributes attributesthrows SAXException {
        if("web-app".equals(name)) {
          String version = attributes.getValue("version");
          String xmlNamespace = attributes.getValue("xmlns");
          String schemaLocation = attributes.getValue("xsi:schemaLocation");
          if(version!=null) {
        super.startElement(uri, localName, name, attributes);
      public InputSource resolveEntity(String publicId, String systemId)
          throws IOException, SAXException {
        return super.resolveEntity(publicId, systemId);

    //creation and invocation of the parser
    SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
    SAXParser parser = saxParserFactory.newSAXParser();
    parser.parse(stream, handler);

Sample output for version 2.3 (DTD) and 2.4 (XSD) files looks like this:
    publicId=-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN


By matching the strings, we get the version. (EDIT: Note that the schemaLocation attribute is optional for documents.)

The cost of the version check can be minimized by:
  • Throwing a SAXException to stop the SAXParser after a match has been found.
  • Reusing the InputStream by using the InputStream.mark and InputStream.reset methods.

No comments:

Post a Comment

All comments are moderated