The simple task of trying to match strings without regard to upper or lower case is surprisingly hard when you try to take into account different language conventions. Things get more complicated when you start to consider accented letters and the legacy of typography.
This post discusses the Java 7 implementation which is documented to suport Unicode 6.0.0.
Note: graphemes will vary based on rendering engines and system fonts.
Case mapping
Here are some examples gleaned from the Unicode specification:
 Mappings depend on alphabet.
In Turkish U+0069
i
maps to U+0130İ
and U+0131ı
maps to U+0049I
. This is known as the "Turkish fourIs" problem.  Mappings are not reversible.
U+00DF
ß
maps toSS
in German (because ß never starts a word) butSS
maps back toss
. The newer capital form U+1E9Eẞ
doesn't make things any simpler  see tailored casing in the specification.  Mappings are contextsensitive.
U+03A3
Σ
maps to U+03C3σ
or U+03C2ς
. It depends on where the letter is in the word.
See also the Character Properties, Case Mappings & Names FAQ.
Collation in Java
The java.text package contains types
for sorting and equivalence testing. The closest thing Java has to a caseless string is the CollationKey
type.
Collator collator = Collator.getInstance(Locale.ENGLISH); collator.setStrength(Collator.PRIMARY); collator.setDecomposition(Collator.NO_DECOMPOSITION); CollationKey lowerCaseA = collator.getCollationKey("a"); CollationKey upperCaseA = collator.getCollationKey("A"); System.out.println(lowerCaseA.equals(upperCaseA)); // true
Sorting
The Collator
type implements Comparator
, so it is useful for sorting
arrays
or lists.
Sets and maps
Use the CollationKey
as the type for Set
types and the key type for Map
implementations.
Collator collator = Collator.getInstance(Locale.ENGLISH); collator.setStrength(Collator.PRIMARY); collator.setDecomposition(Collator.NO_DECOMPOSITION); CollationKey lower = collator.getCollationKey("a"); CollationKey upper = collator.getCollationKey("A"); Object reference = new Object(); Map<CollationKey, Object> map = new HashMap<>(); map.put(lower, reference); Object retrieved = map.get(upper); System.out.println(reference == retrieved); // true
Take care when attempting to use the Collator
to create caseless collections.
Collator collator = Collator.getInstance(Locale.ENGLISH); collator.setStrength(Collator.PRIMARY); collator.setDecomposition(Collator.NO_DECOMPOSITION); // Bug! Do not do this. Set<String> caseless = new TreeSet<>(collator); caseless.add("A"); Set<String> cased = new HashSet<>(Arrays.asList("a")); System.out.println(caseless.equals(cased)); // true System.out.println(cased.equals(caseless)); // false
This code violates the general contract for equals (equality must be symmetric.)
Collators
The Collator
is configured by three things: the locale; the minimum level of difference considered (strength;)
and the decomposition mode for normalizing sequences.
The table below demonstrates the effect of a vector of configurations on the equivalence of a sample set of string pairs. The English and Turkish language locales are denoted by en and tr respectively.
Collator Configuration 
"a" U+0061
"A" U+0041

"ß" U+00df
"ss" U+0073 U+0073

"SS" U+0053 U+0053
"ẞ" U+1e9e

"i" U+0069
"İ" U+0130

"i" U+0069
"ı" U+0131

"A" U+0041
"Ａ" U+ff21

"é" U+00e9
"É" U+0045 U+0301

"é" U+00e9
"é" U+0065 U+0301


en PRIMARY NO_DECOMPOSITION  equal  equal  equal  equal  equal  
en PRIMARY CANONICAL_DECOMPOSITION  equal  equal  equal  equal  equal  
en PRIMARY FULL_DECOMPOSITION  equal  equal  equal  equal  equal  equal  
en SECONDARY NO_DECOMPOSITION  equal  equal  equal  equal  
en SECONDARY CANONICAL_DECOMPOSITION  equal  equal  equal  equal  
en SECONDARY FULL_DECOMPOSITION  equal  equal  equal  equal  equal  
en TERTIARY NO_DECOMPOSITION  equal  
en TERTIARY CANONICAL_DECOMPOSITION  equal  
en TERTIARY FULL_DECOMPOSITION  equal  equal  
en IDENTICAL NO_DECOMPOSITION  
en IDENTICAL CANONICAL_DECOMPOSITION  equal  
en IDENTICAL FULL_DECOMPOSITION  equal  equal  
tr PRIMARY NO_DECOMPOSITION  equal  equal  equal  equal  
tr PRIMARY CANONICAL_DECOMPOSITION  equal  equal  equal  equal  
tr PRIMARY FULL_DECOMPOSITION  equal  equal  equal  equal  equal  
tr SECONDARY NO_DECOMPOSITION  equal  equal  equal  equal  
tr SECONDARY CANONICAL_DECOMPOSITION  equal  equal  equal  equal  
tr SECONDARY FULL_DECOMPOSITION  equal  equal  equal  equal  equal  
tr TERTIARY NO_DECOMPOSITION  equal  
tr TERTIARY CANONICAL_DECOMPOSITION  equal  
tr TERTIARY FULL_DECOMPOSITION  equal  equal  
tr IDENTICAL NO_DECOMPOSITION  
tr IDENTICAL CANONICAL_DECOMPOSITION  equal  
tr IDENTICAL FULL_DECOMPOSITION  equal  equal 
No comments:
Post a Comment
All comments are moderated