Let's take a JavaScript string: "€100"
. This
is going to be sent from a browser input box and stored in a web
server's database. The database is using the UTF-8 encoding and the
constraint on the column is CHAR(4)
. Spot the problem?
From the ECMA specification:
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
As a UTF-16
string, this data will have length of 4 (the
16-bit-based code unit sequence is 20AC 0031 0030 0030
).
When encoded as UTF-8, this string will have a length of 6.
A code unit has a length of 8 (one byte) in UTF-8 and the encoded form
is E2 82 AC 31 30 30
(the first three bytes are the euro
symbol).
It would be best if this problem was caught before the data was sent to the server (you still need to validate input on the server, of course). Many languages and platforms have rich encoding libraries. By comparison, the standard JavaScript library is very lean. The ECMA standard does mandate some encoding functionality for URIs from which we might be able to hack a solution. However, it isn't rocket science to calculate the width of the string in UTF-8 as we can demonstrate:
UTF-16 length: 4
UTF-8 length: 6
The difficulty here is in communicating this information to the
end user. Few people are going to understand why "$100"
is
valid while "€100"
is not.
One work-around could be to triple the column size in the database (or cut the allowed input by a third). This will allow a degree of user-interface consistency. Such an approach may not always be practical. Note: I'm ignoring cases like combining character sequences.
The Code
The utf8ByteCount
function below returns the length
of a string when encoded as UTF-8.
/** * codePoint - an integer containing a Unicode code point * return - the number of bytes required to store the code point in UTF-8 */ function utf8Len(codePoint) { if(codePoint >= 0xD800 && codePoint <= 0xDFFF) throw new Error("Illegal argument: "+codePoint); if(codePoint < 0) throw new Error("Illegal argument: "+codePoint); if(codePoint <= 0x7F) return 1; if(codePoint <= 0x7FF) return 2; if(codePoint <= 0xFFFF) return 3; if(codePoint <= 0x1FFFFF) return 4; if(codePoint <= 0x3FFFFFF) return 5; if(codePoint <= 0x7FFFFFFF) return 6; throw new Error("Illegal argument: "+codePoint); } function isHighSurrogate(codeUnit) { return codeUnit >= 0xD800 && codeUnit <= 0xDBFF; } function isLowSurrogate(codeUnit) { return codeUnit >= 0xDC00 && codeUnit <= 0xDFFF; } /** * Transforms UTF-16 surrogate pairs to a code point. * See RFC2781 */ function toCodepoint(highCodeUnit, lowCodeUnit) { if(!isHighSurrogate(highCodeUnit)) throw new Error("Illegal argument: "+highCodeUnit); if(!isLowSurrogate(lowCodeUnit)) throw new Error("Illegal argument: "+lowCodeUnit); highCodeUnit = (0x3FF & highCodeUnit) << 10; var u = highCodeUnit | (0x3FF & lowCodeUnit); return u + 0x10000; } /** * Counts the length in bytes of a string when encoded as UTF-8. * str - a string * return - the length as an integer */ function utf8ByteCount(str) { var count = 0; for(var i=0; i<str.length; i++) { var ch = str.charCodeAt(i); if(isHighSurrogate(ch)) { var high = ch; var low = str.charCodeAt(++i); count += utf8Len(toCodepoint(high, low)); } else { count += utf8Len(ch); } } return count; }
No comments:
Post a Comment
All comments are moderated