This post discusses HTTP URLs in Java and how to avoid data loss due to encoding/escaping issues. Special mention is made of the query part, since it is frequently used to store data.
Terminology:
- URI. Uniform Resource Indicator. A string with a strictly defined structure.
- URL. Uniform Resource Locator. A URI that points to a resource (like a web page or e-mail address).
- URN. Uniform Resource Name. A name that is a URI. Unlike a URL, there is no reason to expect your browser to return anything if you stick it into the address bar.
Background
Why are there problems?
As with many character encoding issues, the problems are due to a lack of standardisation in character handling when the technology was invented. Initial specifications did not detail how to handle non-ASCII characters in a consistent manner. Subsequent revisions had to deal with the reality in implementations. It was only with RFC 3986 that UTF-8 was mandated for encoding. When encoding is ambiguous, data corruption results.
It has just taken a while to iron out the kinks in URIs. To quote Roy Fielding (from 2008; slightly out of context):
...we spent 12 years reaching a global agreement on the meaning of URI, URL, URN, Web addresses, or whatever else you might call them, in order for all implementations to be interoperable and for all protocols to obey the same restrictions on generation of those identifiers. The result is IETF STD 66, RFC 3986, and it defines the most important standard of all the standards that make up what we call the Web.
Anatomy of a URI
The URI http://waffles:ponies@localhost:80/foo/bar?x=y&a=b#section1
can be broadly broken down into these component parts:
scheme http authority waffles:ponies@localhost:80 path /foo/bar query x=y&a=b fragment section1
Unsafe characters in the URL parts must be encoded using percent-encoding.
Under this scheme, unsafe character data is encoding to a byte array
using the given encoding and then the values are represented as sequence
of hexadecimal pairs prefixed by the percentage symbol. RFC 3986
mandates UTF-8 for this process, so Ā
(code point
U+0100) becomes %C4%80
; legacy specifications might escape
to different values, depending on the encoding used.
The string Hello, World!
, encoded using the method
described in RFC 3986 with the restrictions for a query part, becomes Hello,%20World!
.
The encode and decode process must be symmetrical. There is no metadata in the URI that informs the receiver how it was encoded, so the URI must be encoded in a way that the server understands.
Note that the URI specification treats the query part as being unstructured. That is, it defines what characters can appear in it, but does not say that it should be composed of ampersand-delimited key/value pairs or have any other defined form. The structure of the query part, if any, is left to the scheme.
Note: the percentage-encoding forms described by the
HTML (application/x-www-form-urlencoded
) and URI specs are
similar enough to cause confusion, but contain differences that can
cause data corruption if they are used incorrectly.
URIs and HTTP
The HTTP specification takes as simple a view as possible of URIs:
As far as HTTP is concerned, Uniform Resource Identifiers are simply formatted strings which identify--via name, location, or any other characteristic--a resource.
HTTP imposes no structure on the query part; its usage of this part of the URI is restricted to defining how caching of responses should be performed.
HTML and URIs
HTML defines a method for transmitting form parameters to
application servers. This is the request data of a sample POST
operation:
POST /path HTTP/1.1 Host: localhost User-Agent: Mozilla/4.0 Content-Length: 21 Content-Type: application/x-www-form-urlencoded msg=Hello%2C+World%21
The request content is encoded as application/x-www-form-urlencoded
data. Encoding the string Hello, World!
using this method
produces Hello%2C+World%21
.
Note: URLs with query parts encoded in this form are
legal under the restrictions defined by the ABNF for RFC
3986. RFC 3986
normalisation can transform unreserved
characters so that they are no longer percentage-escaped as specified by
application/x-www-form-urlencoded
encoding, though this is
unlikely to have any adverse effects; Java's URLEncoder
does not encode these characters.
Note: it is likely that HTML 5 will introduce changes to how this process works, especially with regards to character encoding. See resources.
Update Jan 2010: note that the HTML 4.01 spec limits the characters you can use in the fragment part of a URI since it requires them to be valid element identifiers. User agents tend not to enforce these restrictions, but it is worth testing. These restrictions will also change in HTML 5. See resources.
URIs in Java code
The URL class
The URL
class implementation makes it risky to use just as a string manipulation
class. The URL
class still has its uses - particularly with
certain I/O operations - just handle with care. The recommended class
for manipulating URLs is the URI
class.
The URI class
The URI class can encode (using UTF-8) and escape unsafe characters, but can be difficult to work with when you want to use common web conventions (such as using the ampersand and equals characters as query parameter delimiters).
// we want to be able to use ANYTHING for keys/values |
The above code will emit:
weird=&key=strange%value weird%3D%26key=strange%25value
In the decoded form, the ampersand and equals characters in the key will prevent you from parsing the query correctly. You need to use the raw value, then parse and decode it yourself.
URLEncoder and URLDecoder
The javadoc for URL notes:
The URLEncoder and URLDecoder classes can also be used, but only for HTML form encoding, which is not the same as the encoding scheme defined in RFC2396.
URLEncoder
and URLDecoder
escape/unescape application/x-www-form-urlencoded
data, but
they won't parse parameters for you.
A Java URL builder
When it comes to manipulating URLs, sometimes you're better off using something specifically designed to do the job.
// can build a URL |
This code generates the URL http://foo/path?baz=bar&num=%2330&pcnt=100%25#p1
which can be inspected with this code:
URI uri = URI.create("http://foo/path?baz=bar&num=%2330&pcnt=100%25#p1"); |
This emits the value #30
.
I wrote this library as an exercise in API design - you can find the code in the resources. I'm not the first person to create URLs using the builder pattern, so I'm sure your favourite search engine will turn up a few alternatives.
Testing URLs in web applications
In a Servlet container, you can read URL parameters using the ServletRequest
parameter methods. These are exposed via the param
implicit
object in JSPs. If you test with a few Servlet containers, you'll see
any plus characters being turned into spaces on decode - treating it as
application/x-www-form-urlencoded
data.
You may need to configure your application server to decode URLs using the encoding you encoded them with. For example, Apache Tomcat 6 treats URLs as being ISO-8859-1 encoded by default (wiki documentation). Exactly how this configuration is performed is an implementation detail.
The application in the screenshot above is made up of a JSP and a Servlet. The JSP posts to the Servlet, which encodes the POST data in the URL and redirects back to the JSP to display it. This should show up any Unicode bugs in your configuration/URL code.
index.jsp
:
<?xml version="1.0" encoding="UTF-8" ?><jsp:root xmlns:jsp="http://java.sun.com/JSP/Page" version="2.0" xmlns:c="http://java.sun.com/jsp/jstl/core"><jsp:directive.page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8" /><jsp:text><![CDATA[<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ]]> </jsp:text> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> </head> <body> Data from URL: <c:out value="${param.urldata}" /> <br /> <form action="UrlMaker" method="post" accept-charset="UTF-8"><input id="dataToEncode" name="dataToEncode" /> <input value="Update URL" type="submit" /></form> </body> </html> </jsp:root>
A Servlet mapped to /UrlMaker
:
public class UrlMaker extends HttpServlet { |
Resources
Sources for the example URL API code are available in a public Subversion repository.
Repository:
http://illegalargumentexception.googlecode.com/svn/trunk/code/java/
License: MIT
Project: hurl
Binary: hurl1.1.zip
- Internet Official Protocol Standards (RFC 5000)
- HTML 4.01:
- HTML 5 is still in draft. (At time of writing: W3C Working Draft 25 August 2009.) Here are some links to the most recent versions of some relevant sections:
- The Java EE 5 Tutorial Part II: The Web Tier
- Java: a rough guide to character encoding
- Microsoft Multilingual Text Generator - STRGEN (Windows users can use this to generate test strings)
I'm trying to find this via a maven repository, is it published to any of the widely used ones?
ReplyDelete@Anonymous - I have not deployed this code to any repositories and do not intend to.
ReplyDeleteNote that this is not production-tested code.
Thanks for the quick response.
ReplyDeleteDo you happen to know of any similar alternative that is deployed to a repository?
I'm surprised I'm not able to find anything to solve this problem in repositories, it makes me think I'm going about my search in the wrong way since parsing a url is such a common thing.
What's I'm looking for from an API perspective is similar to what hurl does, and I've not been able to find that elsewhere.
Thanks again.
The answers to this Stack Overflow question offer a couple of suggestions.
ReplyDelete