Saturday 16 February 2013

Java: caseless strings

The simple task of trying to match strings without regard to upper or lower case is surprisingly hard when you try to take into account different language conventions. Things get more complicated when you start to consider accented letters and the legacy of typography.

This post discusses the Java 7 implementation which is documented to suport Unicode 6.0.0.

Note: graphemes will vary based on rendering engines and system fonts.

Case mapping

Here are some examples gleaned from the Unicode specification:

  • Mappings depend on alphabet. In Turkish U+0069 i maps to U+0130 İ and U+0131 ı maps to U+0049 I. This is known as the "Turkish four-Is" problem.
  • Mappings are not reversible. U+00DF ß maps to SS in German (because ß never starts a word) but SS maps back to ss. The newer capital form U+1E9E doesn't make things any simpler - see tailored casing in the specification.
  • Mappings are context-sensitive. U+03A3 Σ maps to U+03C3 σ or U+03C2 ς. It depends on where the letter is in the word.

See also the Character Properties, Case Mappings & Names FAQ.

Collation in Java

The java.text package contains types for sorting and equivalence testing. The closest thing Java has to a caseless string is the CollationKey type.

    Collator collator = Collator.getInstance(Locale.ENGLISH);
    collator.setStrength(Collator.PRIMARY);
    collator.setDecomposition(Collator.NO_DECOMPOSITION);

    CollationKey lowerCaseA = collator.getCollationKey("a");
    CollationKey upperCaseA = collator.getCollationKey("A");
    System.out.println(lowerCaseA.equals(upperCaseA)); // true
Sorting

The Collator type implements Comparator, so it is useful for sorting arrays or lists.

Sets and maps

Use the CollationKey as the type for Set types and the key type for Map implementations.

    Collator collator = Collator.getInstance(Locale.ENGLISH);
    collator.setStrength(Collator.PRIMARY);
    collator.setDecomposition(Collator.NO_DECOMPOSITION);

    CollationKey lower = collator.getCollationKey("a");
    CollationKey upper = collator.getCollationKey("A");

    Object reference = new Object();
    Map<CollationKey, Object> map = new HashMap<>();
    map.put(lower, reference);
    Object retrieved = map.get(upper);
    System.out.println(reference == retrieved); // true

Take care when attempting to use the Collator to create caseless collections.

    Collator collator = Collator.getInstance(Locale.ENGLISH);
    collator.setStrength(Collator.PRIMARY);
    collator.setDecomposition(Collator.NO_DECOMPOSITION);

    // Bug! Do not do this.
    Set<String> caseless = new TreeSet<>(collator);
    caseless.add("A");
    Set<String> cased = new HashSet<>(Arrays.asList("a"));

    System.out.println(caseless.equals(cased)); // true
    System.out.println(cased.equals(caseless)); // false

This code violates the general contract for equals (equality must be symmetric.)

Collators

The Collator is configured by three things: the locale; the minimum level of difference considered (strength;) and the decomposition mode for normalizing sequences.

The table below demonstrates the effect of a vector of configurations on the equivalence of a sample set of string pairs. The English and Turkish language locales are denoted by en and tr respectively.

Collator Configuration "a" U+0061
"A" U+0041
"ß" U+00df
"ss" U+0073 U+0073
"SS" U+0053 U+0053
"ẞ" U+1e9e
"i" U+0069
"İ" U+0130
"i" U+0069
"ı" U+0131
"A" U+0041
"A" U+ff21
"é" U+00e9
"É" U+0045 U+0301
"é" U+00e9
"é" U+0065 U+0301
en PRIMARY NO_DECOMPOSITION equal equal equal equal equal
en PRIMARY CANONICAL_DECOMPOSITION equal equal equal equal equal
en PRIMARY FULL_DECOMPOSITION equal equal equal equal equal equal
en SECONDARY NO_DECOMPOSITION equal equal equal equal
en SECONDARY CANONICAL_DECOMPOSITION equal equal equal equal
en SECONDARY FULL_DECOMPOSITION equal equal equal equal equal
en TERTIARY NO_DECOMPOSITION equal
en TERTIARY CANONICAL_DECOMPOSITION equal
en TERTIARY FULL_DECOMPOSITION equal equal
en IDENTICAL NO_DECOMPOSITION
en IDENTICAL CANONICAL_DECOMPOSITION equal
en IDENTICAL FULL_DECOMPOSITION equal equal
tr PRIMARY NO_DECOMPOSITION equal equal equal equal
tr PRIMARY CANONICAL_DECOMPOSITION equal equal equal equal
tr PRIMARY FULL_DECOMPOSITION equal equal equal equal equal
tr SECONDARY NO_DECOMPOSITION equal equal equal equal
tr SECONDARY CANONICAL_DECOMPOSITION equal equal equal equal
tr SECONDARY FULL_DECOMPOSITION equal equal equal equal equal
tr TERTIARY NO_DECOMPOSITION equal
tr TERTIARY CANONICAL_DECOMPOSITION equal
tr TERTIARY FULL_DECOMPOSITION equal equal
tr IDENTICAL NO_DECOMPOSITION
tr IDENTICAL CANONICAL_DECOMPOSITION equal
tr IDENTICAL FULL_DECOMPOSITION equal equal

No comments:

Post a Comment

All comments are moderated