Unfortunately, there is no 100% foolproof process for determining how to validate an arbitrary XML document. If you are receiving a document, you should not leave choosing the validation mechanism to a remote party (e.g. downloading a DTD using its document-specified URI). Doing so opens your application to, at the very least, a potential denial-of-service attack. A validation mechanism may not even be specified in the document: W3 XML Schema (XSD) does not require it; RELAX NG does not seem to support such a mechanism. Then there are some XML documents that just don't have a schema of any form.
Nevertheless, there are times when you need to inspect a document to find out what it is. Most commonly, support is required for multiple versions of a document, where the structure and validation mechanisms change over time.
Note: when talking about validation, this post is not referring to whether the XML is well formed or not. Any XML parser should be able to check the syntax. This is about external constraints imposed on the document structure via a schema, DTD, etc.
Validation schema hints
Validation information in a XML file should be taken as a hint, not an instruction.
Example XML declaration that specifies a DOCTYPE:
<?xml version="1.0" encoding="ISO-8859-1" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> ...
Example XML file that provides a schema location hint:
<?xml version="1.0" encoding="UTF-8"?> <web-app id="WebApp_ID" version="2.4" xmlns="http://java.sun.com/xml/ns/j2ee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd">
Note that this is an equally valid way to express the above document:
<?xml version="1.0" encoding="UTF-8"?> <web-app id="WebApp_ID" version="2.4" xmlns="http://java.sun.com/xml/ns/j2ee">
There are a few things you can inspect to garner information on a document in order to determine how to validate it:
- The DOCTYPE. If a document specifies a DTD, you've found its validation mechanism.
- The Root Element. If the root element of a document is html, it is a good indication that it is a HTML file. This is generally not enough, though - there are numerous versions of HTML and you still need to pick the right DTD. (Let's pretend for a moment that all HTML files are well formed XML.)
- The Root Element's Namespace. If the XML file specifies a namespace, it is a good indicator of how to validate the document. Not all documents, particularly older ones, will specify a namespace.
- Schema Location Hints. The presence of the
http://www.w3.org/2001/XMLSchema-instanceis a good indicator that a XML schema should be used. The XML schema provides two mechanisms to hint at the location of the XSD file: the shemaLocation and noNamespaceSchemaLocation attributes. An XML document that should be validated by schema may provide one, both, or neither of these attributes.
Extracting this information using Java
The standard Java API provides a number of packages for handling XML. Since all we want from the XML is "header" information, there is no need to parse the entire document. This makes the SAX and DOM parsers poor choices. The XML stream reader in StAX is a better fit (tutorial here).
Code fragment showing use of
XMLStreamReader.next() is used to iterate through parsing events. There is a DTD event, but an implementation of XMLResolver can be used as an alternative, both to report entities and to improve performance.
Sample output for three documents (XSD-based; DTD-based; no validation):
File: sample_xsd.xml Root element: web-app Namespace: http://java.sun.com/xml/ns/j2ee DTD name: null DTD URI: null Schema: http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd No namespace schema: null File: sample_dtd.xml Root element: web-app Namespace: null DTD name: -//Sun Microsystems, Inc.//DTD Web Application 2.2//EN DTD URI: http://java.sun.com/j2ee/dtds/web-app_2_2.dtd Schema: null No namespace schema: null File: build.xml Root element: project Namespace: null DTD name: null DTD URI: null Schema: null No namespace schema: null
The first two examples above are for different generations of deployment descriptors for Java web applications. The latter is an Ant build file, which does not support a static definition of its structure.
All the sources are available in a public Subversion repository.