The simple task of trying to match strings without regard to upper or lower case is surprisingly hard when you try to take into account different language conventions. Things get more complicated when you start to consider accented letters and the legacy of typography.
This post discusses the Java 7 implementation which is documented to suport Unicode 6.0.0.
Note: graphemes will vary based on rendering engines and system fonts.
Case mapping
Here are some examples gleaned from the Unicode specification:
- Mappings depend on alphabet.
In Turkish U+0069
imaps to U+0130İand U+0131ımaps to U+0049I. This is known as the "Turkish four-Is" problem. - Mappings are not reversible.
U+00DF
ßmaps toSSin German (because ß never starts a word) butSSmaps back toss. The newer capital form U+1E9Eẞdoesn't make things any simpler - see tailored casing in the specification. - Mappings are context-sensitive.
U+03A3
Σmaps to U+03C3σor U+03C2ς. It depends on where the letter is in the word.
See also the Character Properties, Case Mappings & Names FAQ.
Collation in Java
The java.text package contains types
for sorting and equivalence testing. The closest thing Java has to a caseless string is the CollationKey type.
Collator collator = Collator.getInstance(Locale.ENGLISH);
collator.setStrength(Collator.PRIMARY);
collator.setDecomposition(Collator.NO_DECOMPOSITION);
CollationKey lowerCaseA = collator.getCollationKey("a");
CollationKey upperCaseA = collator.getCollationKey("A");
System.out.println(lowerCaseA.equals(upperCaseA)); // true
Sorting
The Collator type implements Comparator, so it is useful for sorting
arrays
or lists.
Sets and maps
Use the CollationKey as the type for Set types and the key type for Map implementations.
Collator collator = Collator.getInstance(Locale.ENGLISH);
collator.setStrength(Collator.PRIMARY);
collator.setDecomposition(Collator.NO_DECOMPOSITION);
CollationKey lower = collator.getCollationKey("a");
CollationKey upper = collator.getCollationKey("A");
Object reference = new Object();
Map<CollationKey, Object> map = new HashMap<>();
map.put(lower, reference);
Object retrieved = map.get(upper);
System.out.println(reference == retrieved); // true
Take care when attempting to use the Collator to create caseless collections.
Collator collator = Collator.getInstance(Locale.ENGLISH);
collator.setStrength(Collator.PRIMARY);
collator.setDecomposition(Collator.NO_DECOMPOSITION);
// Bug! Do not do this.
Set<String> caseless = new TreeSet<>(collator);
caseless.add("A");
Set<String> cased = new HashSet<>(Arrays.asList("a"));
System.out.println(caseless.equals(cased)); // true
System.out.println(cased.equals(caseless)); // false
This code violates the general contract for equals (equality must be symmetric.)
Collators
The Collator is configured by three things: the locale; the minimum level of difference considered (strength;)
and the decomposition mode for normalizing sequences.
The table below demonstrates the effect of a vector of configurations on the equivalence of a sample set of string pairs. The English and Turkish language locales are denoted by en and tr respectively.
| Collator Configuration |
"a" U+0061
"A" U+0041
|
"ß" U+00df
"ss" U+0073 U+0073
|
"SS" U+0053 U+0053
"ẞ" U+1e9e
|
"i" U+0069
"İ" U+0130
|
"i" U+0069
"ı" U+0131
|
"A" U+0041
"A" U+ff21
|
"é" U+00e9
"É" U+0045 U+0301
|
"é" U+00e9
"é" U+0065 U+0301
|
|---|---|---|---|---|---|---|---|---|
| en PRIMARY NO_DECOMPOSITION | equal | equal | equal | equal | equal | |||
| en PRIMARY CANONICAL_DECOMPOSITION | equal | equal | equal | equal | equal | |||
| en PRIMARY FULL_DECOMPOSITION | equal | equal | equal | equal | equal | equal | ||
| en SECONDARY NO_DECOMPOSITION | equal | equal | equal | equal | ||||
| en SECONDARY CANONICAL_DECOMPOSITION | equal | equal | equal | equal | ||||
| en SECONDARY FULL_DECOMPOSITION | equal | equal | equal | equal | equal | |||
| en TERTIARY NO_DECOMPOSITION | equal | |||||||
| en TERTIARY CANONICAL_DECOMPOSITION | equal | |||||||
| en TERTIARY FULL_DECOMPOSITION | equal | equal | ||||||
| en IDENTICAL NO_DECOMPOSITION | ||||||||
| en IDENTICAL CANONICAL_DECOMPOSITION | equal | |||||||
| en IDENTICAL FULL_DECOMPOSITION | equal | equal | ||||||
| tr PRIMARY NO_DECOMPOSITION | equal | equal | equal | equal | ||||
| tr PRIMARY CANONICAL_DECOMPOSITION | equal | equal | equal | equal | ||||
| tr PRIMARY FULL_DECOMPOSITION | equal | equal | equal | equal | equal | |||
| tr SECONDARY NO_DECOMPOSITION | equal | equal | equal | equal | ||||
| tr SECONDARY CANONICAL_DECOMPOSITION | equal | equal | equal | equal | ||||
| tr SECONDARY FULL_DECOMPOSITION | equal | equal | equal | equal | equal | |||
| tr TERTIARY NO_DECOMPOSITION | equal | |||||||
| tr TERTIARY CANONICAL_DECOMPOSITION | equal | |||||||
| tr TERTIARY FULL_DECOMPOSITION | equal | equal | ||||||
| tr IDENTICAL NO_DECOMPOSITION | ||||||||
| tr IDENTICAL CANONICAL_DECOMPOSITION | equal | |||||||
| tr IDENTICAL FULL_DECOMPOSITION | equal | equal |
No comments:
Post a Comment
All comments are moderated