I have two strings in Javascript: \"_strange_chars_µö¬é@zendesk.com.eml\"
(f1
) and \"_strange_chars_µö¬é@zendesk.com.eml\"
(f2
f1
uses the ö character,f2
uses an o and a diacritic ¨ as a separate character.
f1
is in Normal Form C (composed) and f2
in Normal Form D (decomposed). In general Normal Form C is the most common on Windows and the web, with the Unicode FAQ describing it as “the best form for general text”. Unfortunately the Apple world plumped for Normal Form D in order to be gratuitously different.
The strings are canonically equivalent by the rules of Unicode equivalence.
What comparison can I do that will show these two strings to be "equal"?
In general, you convert both strings to one Normal Form of your choosing and then compare them. For example in Python:
>>> import unicodedata
>>> a= u'\u00F6' # ö composed
>>> b= u'o\u0308' # o then combining umlaut
>>> unicodedata.normalize('NFC', a)==unicodedata.normalize('NFC', b)
True
Similarly Java has the Normalizer
class, .NET has String.Normalize
, and may languages have bindings available to the ICU library which also offers this feature.
Unfortunately, JavaScript has no native Unicode normalisation ability. This means either:
doing it yourself, carting around large Unicode data tables to cover it all in JavaScript (see eg here for an example implementation); or
sending it back to the server-side (eg via XMLHttpRequest), where you've got a better-equipped language to do it.