Thursday, 31 January 2013

Java: validating regular expression syntax during compilation

Java doesn't support regular expression literals like some other languages. Still, you can get half way.

Regular expression literals

Here's a trivial regular expression example in Java that replaces forward slashes with backslashes:

public class Slashes {
  public static void main(String[] args) {
    String windowsPath = "foo\\bar\\baz";
    String expression = "\\\\";
    String unixPath = windowsPath.replaceAll(expression, "/");
    System.out.append(windowsPath).append(" -> ").println(unixPath);
  }
}

In order to replace \ we need to match using \\\\ because backslash is the escape character in both Java strings and regular expressions.

Where a language supports regex literals the expressions become less verbose and syntax errors can be detected by the parser. Here is the same operation in JavaScript:

var windowsPath = "foo\\bar\\baz";
var expression = /\\/;
var unixPath = windowsPath.replace(expression, "/");

Static inspection tools can detect syntax errors in the parser - see this malformed expression:

/\/
SyntaxError: Invalid regular expression: missing /

In the Java version the parser is blind to any syntax errors in the regular expression grammar because the compiler treats it as just a string. This problem can be overcome.

Validating string literals with annotations

From Java 6 onwards, annotation processing is supported by javac as part of the compile process.

Here's an example class showing the use of a custom RegexSyntax annotation:

import blog.iae.regex.annotation.RegexSyntax;

class Foo {
  /** OK */
  @RegexSyntax
  final String matchLatinCapitals = "[A-Z]+";

  /** Not legal regular expression */
  @RegexSyntax
  final String fail = "++";

  /** But it is if escaped */
  final String escaped = "\\+\\+";

  /** Or as a literal */
  @RegexSyntax(flags = java.util.regex.Pattern.LITERAL)
  final String literal = "++";

  String winToUnix(String path) {
    @RegexSyntax
    final String winSlash = "\\\\";
    return path.replaceAll(winSlash, "/");
  }
}

All are legal Java syntax, but one is not a valid regular expression. Compilation fails:

>javac -cp regex-annotation-0.0.2.jar Foo.java
Foo.java:10: error: blog.iae.regex.annotation.RegexSyntax: Dangling meta character '+' near index 0
  final String fail = "++";
               ^
  ++
  ^
1 error

The literals are validated at compile time by a Processor implementation:

  // snippet from a Processor implementation
  
  /**
   * Processes elements annotated with {@link RegexSyntax}.
   */
  @Override
  public boolean process(Set<? extends TypeElement> annotations,
      RoundEnvironment roundEnv) {
    for (Element target : roundEnv.getElementsAnnotatedWith(RegexSyntax.class)) {
      if (isVariable(target) && isFinal(target) && isStringConstant(target))
        validateExpression(target);
    }
    return true;
  }

  /**
   * Emits an error if the string is not legal regular expression syntax.
   * 
   * @see ProcessingEnvironment
   * @see Pattern
   */
  private void validateExpression(Element element) {
    VariableElement variable = (VariableElement) element;
    String pattern = variable.getConstantValue().toString();
    try {
      int flags = element.getAnnotation(RegexSyntax.class).flags();
      Pattern.compile(pattern, flags);
    } catch (PatternSyntaxException e) {
      String err = RegexSyntax.class.getName() + ": " + e.getLocalizedMessage();
      env.getMessager().printMessage(Kind.ERROR, err, element);
    }
  }

Since you can't annotate a literal, it must be referenced via an annotated variable. The literal must be a constant assigned to a variable that is declared final.

See Getting Started with the Annotation Processing Tool, apt for a short annotation processor development guide.

IDEs and other compilers

Not all tool chains will run annotation processors automatically, but most can be configured to.

Annotation errors in Eclipse

Refer to your tool documentation for specifics.

Sample code

All the sources are available in a public Subversion repository.

Repository: http://illegalargumentexception.googlecode.com/svn/trunk/code/java/
License: MIT
Project: regex-annotation

You can download the prebuilt binary regex-annotaion-0.0.2.zip from the Downloads page.

No comments:

Post a Comment

All comments are moderated