The simple task of trying to match strings without regard to upper or lower case is surprisingly hard when you try to take into account different language conventions. Things get more complicated when you start to consider accented letters and the legacy of typography.
This post discusses the Java 7 implementation which is documented to suport Unicode 6.0.0.
Note: graphemes will vary based on rendering engines and system fonts.
Case mapping
Here are some examples gleaned from the Unicode specification:
- Mappings depend on alphabet.
In Turkish U+0069
i
maps to U+0130İ
and U+0131ı
maps to U+0049I
. This is known as the "Turkish four-Is" problem. - Mappings are not reversible.
U+00DF
ß
maps toSS
in German (because ß never starts a word) butSS
maps back toss
. The newer capital form U+1E9Eẞ
doesn't make things any simpler - see tailored casing in the specification. - Mappings are context-sensitive.
U+03A3
Σ
maps to U+03C3σ
or U+03C2ς
. It depends on where the letter is in the word.
See also the Character Properties, Case Mappings & Names FAQ.
Collation in Java
The java.text package contains types
for sorting and equivalence testing. The closest thing Java has to a caseless string is the CollationKey
type.
Collator collator = Collator.getInstance(Locale.ENGLISH); collator.setStrength(Collator.PRIMARY); collator.setDecomposition(Collator.NO_DECOMPOSITION); CollationKey lowerCaseA = collator.getCollationKey("a"); CollationKey upperCaseA = collator.getCollationKey("A"); System.out.println(lowerCaseA.equals(upperCaseA)); // true
Sorting
The Collator
type implements Comparator
, so it is useful for sorting
arrays
or lists.
Sets and maps
Use the CollationKey
as the type for Set
types and the key type for Map
implementations.
Collator collator = Collator.getInstance(Locale.ENGLISH); collator.setStrength(Collator.PRIMARY); collator.setDecomposition(Collator.NO_DECOMPOSITION); CollationKey lower = collator.getCollationKey("a"); CollationKey upper = collator.getCollationKey("A"); Object reference = new Object(); Map<CollationKey, Object> map = new HashMap<>(); map.put(lower, reference); Object retrieved = map.get(upper); System.out.println(reference == retrieved); // true
Take care when attempting to use the Collator
to create caseless collections.
Collator collator = Collator.getInstance(Locale.ENGLISH); collator.setStrength(Collator.PRIMARY); collator.setDecomposition(Collator.NO_DECOMPOSITION); // Bug! Do not do this. Set<String> caseless = new TreeSet<>(collator); caseless.add("A"); Set<String> cased = new HashSet<>(Arrays.asList("a")); System.out.println(caseless.equals(cased)); // true System.out.println(cased.equals(caseless)); // false
This code violates the general contract for equals (equality must be symmetric.)
Collators
The Collator
is configured by three things: the locale; the minimum level of difference considered (strength;)
and the decomposition mode for normalizing sequences.
The table below demonstrates the effect of a vector of configurations on the equivalence of a sample set of string pairs. The English and Turkish language locales are denoted by en and tr respectively.
Collator Configuration |
"a" U+0061
"A" U+0041
|
"ß" U+00df
"ss" U+0073 U+0073
|
"SS" U+0053 U+0053
"ẞ" U+1e9e
|
"i" U+0069
"İ" U+0130
|
"i" U+0069
"ı" U+0131
|
"A" U+0041
"A" U+ff21
|
"é" U+00e9
"É" U+0045 U+0301
|
"é" U+00e9
"é" U+0065 U+0301
|
---|---|---|---|---|---|---|---|---|
en PRIMARY NO_DECOMPOSITION | equal | equal | equal | equal | equal | |||
en PRIMARY CANONICAL_DECOMPOSITION | equal | equal | equal | equal | equal | |||
en PRIMARY FULL_DECOMPOSITION | equal | equal | equal | equal | equal | equal | ||
en SECONDARY NO_DECOMPOSITION | equal | equal | equal | equal | ||||
en SECONDARY CANONICAL_DECOMPOSITION | equal | equal | equal | equal | ||||
en SECONDARY FULL_DECOMPOSITION | equal | equal | equal | equal | equal | |||
en TERTIARY NO_DECOMPOSITION | equal | |||||||
en TERTIARY CANONICAL_DECOMPOSITION | equal | |||||||
en TERTIARY FULL_DECOMPOSITION | equal | equal | ||||||
en IDENTICAL NO_DECOMPOSITION | ||||||||
en IDENTICAL CANONICAL_DECOMPOSITION | equal | |||||||
en IDENTICAL FULL_DECOMPOSITION | equal | equal | ||||||
tr PRIMARY NO_DECOMPOSITION | equal | equal | equal | equal | ||||
tr PRIMARY CANONICAL_DECOMPOSITION | equal | equal | equal | equal | ||||
tr PRIMARY FULL_DECOMPOSITION | equal | equal | equal | equal | equal | |||
tr SECONDARY NO_DECOMPOSITION | equal | equal | equal | equal | ||||
tr SECONDARY CANONICAL_DECOMPOSITION | equal | equal | equal | equal | ||||
tr SECONDARY FULL_DECOMPOSITION | equal | equal | equal | equal | equal | |||
tr TERTIARY NO_DECOMPOSITION | equal | |||||||
tr TERTIARY CANONICAL_DECOMPOSITION | equal | |||||||
tr TERTIARY FULL_DECOMPOSITION | equal | equal | ||||||
tr IDENTICAL NO_DECOMPOSITION | ||||||||
tr IDENTICAL CANONICAL_DECOMPOSITION | equal | |||||||
tr IDENTICAL FULL_DECOMPOSITION | equal | equal |
No comments:
Post a Comment
All comments are moderated