Java: caseless strings

The simple task of trying to match strings without regard to upper or lower case is surprisingly hard when you try to take into account different language conventions. Things get more complicated when you start to consider accented letters and the legacy of typography.

This post discusses the Java 7 implementation which is documented to suport Unicode 6.0.0.

Note: graphemes will vary based on rendering engines and system fonts.

Case mapping

Here are some examples gleaned from the Unicode specification:

Mappings depend on alphabet. In Turkish U+0069 i maps to U+0130 İ and U+0131 ı maps to U+0049 I. This is known as the "Turkish four-Is" problem.
Mappings are not reversible. U+00DF ß maps to SS in German (because ß never starts a word) but SS maps back to ss. The newer capital form U+1E9E ẞ doesn't make things any simpler - see tailored casing in the specification.
Mappings are context-sensitive. U+03A3 Σ maps to U+03C3 σ or U+03C2 ς. It depends on where the letter is in the word.

Collation in Java

The java.text package contains types for sorting and equivalence testing. The closest thing Java has to a caseless string is the CollationKey type.

    Collator collator = Collator.getInstance(Locale.ENGLISH);
    collator.setStrength(Collator.PRIMARY);
    collator.setDecomposition(Collator.NO_DECOMPOSITION);

    CollationKey lowerCaseA = collator.getCollationKey("a");
    CollationKey upperCaseA = collator.getCollationKey("A");
    System.out.println(lowerCaseA.equals(upperCaseA)); // true

Sorting

The Collator type implements Comparator, so it is useful for sorting arrays or lists.

Sets and maps

Use the CollationKey as the type for Set types and the key type for Map implementations.

    Collator collator = Collator.getInstance(Locale.ENGLISH);
    collator.setStrength(Collator.PRIMARY);
    collator.setDecomposition(Collator.NO_DECOMPOSITION);

    CollationKey lower = collator.getCollationKey("a");
    CollationKey upper = collator.getCollationKey("A");

    Object reference = new Object();
    Map<CollationKey, Object> map = new HashMap<>();
    map.put(lower, reference);
    Object retrieved = map.get(upper);
    System.out.println(reference == retrieved); // true

Take care when attempting to use the Collator to create caseless collections.

    Collator collator = Collator.getInstance(Locale.ENGLISH);
    collator.setStrength(Collator.PRIMARY);
    collator.setDecomposition(Collator.NO_DECOMPOSITION);

    // Bug! Do not do this.
    Set<String> caseless = new TreeSet<>(collator);
    caseless.add("A");
    Set<String> cased = new HashSet<>(Arrays.asList("a"));

    System.out.println(caseless.equals(cased)); // true
    System.out.println(cased.equals(caseless)); // false

This code violates the general contract for equals (equality must be symmetric.)

Collators

The Collator is configured by three things: the locale; the minimum level of difference considered (strength;) and the decomposition mode for normalizing sequences.

The table below demonstrates the effect of a vector of configurations on the equivalence of a sample set of string pairs. The English and Turkish language locales are denoted by en and tr respectively.

Collator Configuration	`"a"` U+0061 `"A"` U+0041	`"ß"` U+00df `"ss"` U+0073 U+0073	`"i"` U+0069 `"İ"` U+0130	`"A"` U+0041 `"Ａ"` U+ff21	`"é"` U+00e9 `"É"` U+0045 U+0301	`"é"` U+00e9 `"é"` U+0065 U+0301
en PRIMARY NO_DECOMPOSITION	equal	equal	equal		equal	equal
en PRIMARY CANONICAL_DECOMPOSITION	equal	equal	equal		equal	equal
en PRIMARY FULL_DECOMPOSITION	equal	equal	equal	equal	equal	equal
en SECONDARY NO_DECOMPOSITION	equal	equal			equal	equal
en SECONDARY CANONICAL_DECOMPOSITION	equal	equal			equal	equal
en SECONDARY FULL_DECOMPOSITION	equal	equal		equal	equal	equal
en TERTIARY NO_DECOMPOSITION						equal
en TERTIARY CANONICAL_DECOMPOSITION						equal
en TERTIARY FULL_DECOMPOSITION				equal		equal
en IDENTICAL NO_DECOMPOSITION
en IDENTICAL CANONICAL_DECOMPOSITION						equal
en IDENTICAL FULL_DECOMPOSITION				equal		equal
tr PRIMARY NO_DECOMPOSITION	equal		equal		equal	equal
tr PRIMARY CANONICAL_DECOMPOSITION	equal		equal		equal	equal
tr PRIMARY FULL_DECOMPOSITION	equal		equal	equal	equal	equal
tr SECONDARY NO_DECOMPOSITION	equal		equal		equal	equal
tr SECONDARY CANONICAL_DECOMPOSITION	equal		equal		equal	equal
tr SECONDARY FULL_DECOMPOSITION	equal		equal	equal	equal	equal
tr TERTIARY NO_DECOMPOSITION						equal
tr TERTIARY CANONICAL_DECOMPOSITION						equal
tr TERTIARY FULL_DECOMPOSITION				equal		equal
tr IDENTICAL NO_DECOMPOSITION
tr IDENTICAL CANONICAL_DECOMPOSITION						equal
tr IDENTICAL FULL_DECOMPOSITION				equal		equal

Illegal Argument Exception

Saturday 16 February 2013