Java: safe character handling and URL building

This post discusses HTTP URLs in Java and how to avoid data loss due to encoding/escaping issues. Special mention is made of the query part, since it is frequently used to store data.

Background
URIs in Java code
Resources

Terminology:

URI. Uniform Resource Indicator. A string with a strictly defined structure.
URL. Uniform Resource Locator. A URI that points to a resource (like a web page or e-mail address).
URN. Uniform Resource Name. A name that is a URI. Unlike a URL, there is no reason to expect your browser to return anything if you stick it into the address bar.

Background

Why are there problems?

As with many character encoding issues, the problems are due to a lack of standardisation in character handling when the technology was invented. Initial specifications did not detail how to handle non-ASCII characters in a consistent manner. Subsequent revisions had to deal with the reality in implementations. It was only with RFC 3986 that UTF-8 was mandated for encoding. When encoding is ambiguous, data corruption results.

It has just taken a while to iron out the kinks in URIs. To quote Roy Fielding (from 2008; slightly out of context):

...we spent 12 years reaching a global agreement on the meaning of URI, URL, URN, Web addresses, or whatever else you might call them, in order for all implementations to be interoperable and for all protocols to obey the same restrictions on generation of those identifiers. The result is IETF STD 66, RFC 3986, and it defines the most important standard of all the standards that make up what we call the Web.

Anatomy of a URI

The URI http://waffles:ponies@localhost:80/foo/bar?x=y&a=b#section1 can be broadly broken down into these component parts:

scheme        http
authority     waffles:ponies@localhost:80
path          /foo/bar
query         x=y&a=b
fragment      section1

Unsafe characters in the URL parts must be encoded using percent-encoding. Under this scheme, unsafe character data is encoding to a byte array using the given encoding and then the values are represented as sequence of hexadecimal pairs prefixed by the percentage symbol. RFC 3986 mandates UTF-8 for this process, so Ā (code point U+0100) becomes %C4%80; legacy specifications might escape to different values, depending on the encoding used.

The string Hello, World!, encoded using the method described in RFC 3986 with the restrictions for a query part, becomes Hello,%20World!.

The encode and decode process must be symmetrical. There is no metadata in the URI that informs the receiver how it was encoded, so the URI must be encoded in a way that the server understands.

Note that the URI specification treats the query part as being unstructured. That is, it defines what characters can appear in it, but does not say that it should be composed of ampersand-delimited key/value pairs or have any other defined form. The structure of the query part, if any, is left to the scheme.

Note: the percentage-encoding forms described by the HTML (application/x-www-form-urlencoded) and URI specs are similar enough to cause confusion, but contain differences that can cause data corruption if they are used incorrectly.

URIs and HTTP

The HTTP specification takes as simple a view as possible of URIs:

As far as HTTP is concerned, Uniform Resource Identifiers are simply formatted strings which identify--via name, location, or any other characteristic--a resource.

HTTP imposes no structure on the query part; its usage of this part of the URI is restricted to defining how caching of responses should be performed.

HTML and URIs

HTML defines a method for transmitting form parameters to application servers. This is the request data of a sample POST operation:

POST /path HTTP/1.1
Host: localhost
User-Agent: Mozilla/4.0
Content-Length: 21
Content-Type: application/x-www-form-urlencoded

msg=Hello%2C+World%21

The request content is encoded as application/x-www-form-urlencoded data. Encoding the string Hello, World! using this method produces Hello%2C+World%21.

Note: URLs with query parts encoded in this form are legal under the restrictions defined by the ABNF for RFC 3986. RFC 3986 normalisation can transform unreserved characters so that they are no longer percentage-escaped as specified by application/x-www-form-urlencoded encoding, though this is unlikely to have any adverse effects; Java's URLEncoder does not encode these characters.

Note: it is likely that HTML 5 will introduce changes to how this process works, especially with regards to character encoding. See resources.

Update Jan 2010: note that the HTML 4.01 spec limits the characters you can use in the fragment part of a URI since it requires them to be valid element identifiers. User agents tend not to enforce these restrictions, but it is worth testing. These restrictions will also change in HTML 5. See resources.

URIs in Java code

The URL class

The URL class implementation makes it risky to use just as a string manipulation class. The URL class still has its uses - particularly with certain I/O operations - just handle with care. The recommended class for manipulating URLs is the URI class.

The URI class

The URI class can encode (using UTF-8) and escape unsafe characters, but can be difficult to work with when you want to use common web conventions (such as using the ampersand and equals characters as query parameter delimiters).

     // we want to be able to use ANYTHING for keys/values

        String key = "weird=&key";

        String value = "strange%value";

        // these values as an escaped query

        String query = "weird%3D%26key=strange%25value";

        URI uri = URI.create("http://foo?" + query);

        System.out.println(uri.getQuery());

        System.out.println(uri.getRawQuery());

The above code will emit:

weird=&key=strange%value
weird%3D%26key=strange%25value

In the decoded form, the ampersand and equals characters in the key will prevent you from parsing the query correctly. You need to use the raw value, then parse and decode it yourself.

URLEncoder and URLDecoder

The javadoc for URL notes:

The URLEncoder and URLDecoder classes can also be used, but only for HTML form encoding, which is not the same as the encoding scheme defined in RFC2396.

URLEncoder and URLDecoder escape/unescape application/x-www-form-urlencoded data, but they won't parse parameters for you.

A Java URL builder

When it comes to manipulating URLs, sometimes you're better off using something specifically designed to do the job.

     // can build a URL

        UriBuilder uriBuilder = UriBuilder.create("http://foo/path?baz=bar#p1");

        // used to edit query parts

        QueryBuilder query = QueryBuilder.create().parse(uriBuilder.getQuery())

            .addParam("num", "#30").addParam("pcnt", "100%");

        URI uri = uriBuilder.setQuery(query).build();

        System.out.println(uri);

This code generates the URL http://foo/path?baz=bar&num=%2330&pcnt=100%25#p1 which can be inspected with this code:

     URI uri = URI.create("http://foo/path?baz=bar&num=%2330&pcnt=100%25#p1");

        // query the query

        Query query = QueryBuilder.create().parse(uri.getRawQuery()).build();

        System.out.println(query.findValue("num"));

This emits the value #30.

I wrote this library as an exercise in API design - you can find the code in the resources. I'm not the first person to create URLs using the builder pattern, so I'm sure your favourite search engine will turn up a few alternatives.

Testing URLs in web applications

In a Servlet container, you can read URL parameters using the ServletRequest parameter methods. These are exposed via the param implicit object in JSPs. If you test with a few Servlet containers, you'll see any plus characters being turned into spaces on decode - treating it as application/x-www-form-urlencoded data.

You may need to configure your application server to decode URLs using the encoding you encoded them with. For example, Apache Tomcat 6 treats URLs as being ISO-8859-1 encoded by default (wiki documentation). Exactly how this configuration is performed is an implementation detail.

The application in the screenshot above is made up of a JSP and a Servlet. The JSP posts to the Servlet, which encodes the POST data in the URL and redirects back to the JSP to display it. This should show up any Unicode bugs in your configuration/URL code.

index.jsp:

<?xml version="1.0" encoding="UTF-8" ?><jsp:root 
    xmlns:jsp="http://java.sun.com/JSP/Page" version="2.0"
    xmlns:c="http://java.sun.com/jsp/jstl/core"><jsp:directive.page
    language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"
    /><jsp:text><![CDATA[<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ]]>
    </jsp:text>
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    </head>
    <body>
    Data from URL: <c:out value="${param.urldata}" />
    <br />
    <form action="UrlMaker" method="post" accept-charset="UTF-8"><input
        id="dataToEncode" name="dataToEncode" /> <input value="Update URL"
        type="submit" /></form>
    </body>
    </html>
</jsp:root>

A Servlet mapped to /UrlMaker:

 public class UrlMaker extends HttpServlet {

      private static final long serialVersionUID = 1L;

    

      protected void doPost(HttpServletRequest request, HttpServletResponse response)

          throws ServletException, IOException {

        String dataToEncode = request.getParameter("dataToEncode");

        PathBuilder path = PathBuilder.create().parse(request.getContextPath())

            .addElement("index.jsp");

        UriBuilder uriBuilder = UriBuilder.create().setPath(path);

        if (dataToEncode != null) {

          QueryBuilder query = QueryBuilder.create().addParam("urldata",

              dataToEncode);

          uriBuilder.setQuery(query);

        }

        response.sendRedirect(uriBuilder.toString());

      }

    }

Resources

Sources for the example URL API code are available in a public Subversion repository.

Repository: http://illegalargumentexception.googlecode.com/svn/trunk/code/java/
License: MIT
Project: hurl
Binary: hurl1.1.zip

Internet Official Protocol Standards (RFC 5000)
- Uniform Resource Identifier (URI): Generic Syntax (RFC 3986)
- Hypertext Transfer Protocol -- HTTP/1.1
HTML 4.01:
- 17.13.4 Form content types
- 6.2 SGML basic types: ID and NAME tokens
HTML 5 is still in draft. (At time of writing: W3C Working Draft 25 August 2009.) Here are some links to the most recent versions of some relevant sections:
The Java EE 5 Tutorial Part II: The Web Tier
Java: a rough guide to character encoding
Microsoft Multilingual Text Generator - STRGEN (Windows users can use this to generate test strings)

4 comments:

Anonymous11 December 2011 at 03:57
I'm trying to find this via a maven repository, is it published to any of the widely used ones?
McDowell11 December 2011 at 10:58
@Anonymous - I have not deployed this code to any repositories and do not intend to.

Note that this is not production-tested code.
Anonymous11 December 2011 at 13:11
Thanks for the quick response.

Do you happen to know of any similar alternative that is deployed to a repository?

I'm surprised I'm not able to find anything to solve this problem in repositories, it makes me think I'm going about my search in the wrong way since parsing a url is such a common thing.

What's I'm looking for from an API perspective is similar to what hurl does, and I've not been able to find that elsewhere.

Thanks again.
McDowell27 December 2011 at 11:40
The answers to this Stack Overflow question offer a couple of suggestions.

All comments are moderated

Illegal Argument Exception

Thursday, 17 December 2009