What is normalized UTF-8 all about?

后端 未结 7 899
没有蜡笔的小新
没有蜡笔的小新 2020-11-29 15:26

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

Ho

相关标签:
7条回答
  • 2020-11-29 15:52

    Everything You Never Wanted to Know about Unicode Normalization

    Canonical Normalization

    Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form. The resulting code points should appear identical to the original ones barring any bugs in the fonts or rendering engine.

    When To Use

    Because the results appear identical, it is always safe to apply canonical normalization to a string before storing or displaying it, as long as you can tolerate the result not being bit for bit identical to the input.

    Canonical normalization comes in 2 forms: NFD and NFC. The two are equivalent in the sense that one can convert between these two forms without loss. Comparing two strings under NFC will always give the same result as comparing them under NFD.

    NFD

    NFD has the characters fully expanded out. This is the faster normalization form to calculate, but the results in more code points (i.e. uses more space).

    If you just want to compare two strings that are not already normalized, this is the preferred normalization form unless you know you need compatibility normalization.

    NFC

    NFC recombines code points when possible after running the NFD algorithm. This takes a little longer, but results in shorter strings.

    Compatibility Normalization

    Unicode also includes many characters that really do not belong, but were used in legacy character sets. Unicode added these to allow text in those character sets to be processed as Unicode, and then be converted back without loss.

    Compatibility normalization converts these to the corresponding sequence of "real" characters, and also performs canonical normalization. The results of compatibility normalization may not appear identical to the originals.

    Characters that include formatting information are replaced with ones that do not. For example the character gets converted to 9. Others don't involve formatting differences. For example the roman numeral character is converted to the regular letters IX.

    Obviously, once this transformation has been performed, it is no longer possible to losslessly convert back to the original character set.

    When to use

    The Unicode Consortium suggests thinking of compatibility normalization like a ToUpperCase transform. It is something that may be useful in some circumstances, but you should not just apply it willy-nilly.

    An excellent use case would be a search engine since you would probably want a search for 9 to match .

    One thing you should probably not do is display the result of applying compatibility normalization to the user.

    NFKC/NFKD

    Compatibility normalization form comes in two forms NFKD and NFKC. They have the same relationship as between NFD and C.

    Any string in NFKC is inherently also in NFC, and the same for the NFKD and NFD. Thus NFKD(x)=NFD(NFKC(x)), and NFKC(x)=NFC(NFKD(x)), etc.

    Conclusion

    If in doubt, go with canonical normalization. Choose NFC or NFD based on the space/speed trade-off applicable, or based on what is required by something you are inter-operating with.

    0 讨论(0)
  • 2020-11-29 15:54

    This is actually fairly simple. UTF-8 actually has several different representations of the same "character". (I use character in quotes since byte-wise they are different, but practically they are the same). An example is given in the linked document.

    The character "Ç" can be represented as the byte sequence 0xc387. But it can also be represented by a C (0x43) followed by the byte sequence 0xcca7. So you can say that 0xc387 and 0x43cca7 are the same character. The reason that works, is that 0xcca7 is a combining mark; that is to say it takes the character before it (a C here), and modifies it.

    Now, as far as the difference between canonical equivalence vs compatibility equivalence, we need to look at characters in general.

    There are 2 types of characters, those that convey meaning through the value, and those that take another character and alter it. 9 is a meaningful character. A super-script ⁹ takes that meaning and alters it by presentation. So canonically they have different meanings, but they still represent the base character.

    Canonical equivalence is where the byte sequence is rendering the same character with the same meaning. Compatibility equivalence is when the byte sequence is rendering a different character with the same base meaning (even though it may be altered). The 9 and ⁹ are compatibility equivalent since they both mean "9", but are not canonically equivalent since they don't have the same representation.

    0 讨论(0)
  • 2020-11-29 15:59

    Some characters, for example a letter with an accent (say, é) can be represented in two ways - a single code point U+00E9 or the plain letter followed by a combining accent mark U+0065 U+0301. Ordinary normalization will choose one of these to always represent it (the single code point for NFC, the combining form for NFD).

    For characters that could be represented by multiple sequences of base characters and combining marks (say, "s, dot below, dot above" vs putting dot above then dot below or using a base character that already has one of the dots), NFD will also pick one of these (below goes first, as it happens)

    The compatibility decompositions include a number of characters that "shouldn't really" be characters but are because they were used in legacy encodings. Ordinary normalization won't unify these (to preserve round-trip integrity - this isn't an issue for the combining forms because no legacy encoding [except a handful of vietnamese encodings] used both), but compatibility normalization will. Think like the "kg" kilogram sign that appears in some East Asian encodings (or the halfwidth/fullwidth katakana and alphabet), or the "fi" ligature in MacRoman.

    See http://unicode.org/reports/tr15/ for more details.

    0 讨论(0)
  • 2020-11-29 16:05

    The problem of compare strings: two strings with content that is equivalent for the purposes of most applications may contain differing character sequences.

    See Unicode's canonical equivalence: if the comparison algorithm is simple (or must be fast), the Unicode equivalence is not performed. This problem occurs, for instance, in XML canonical comparison, see http://www.w3.org/TR/xml-c14n

    To avoid this problem... What standard to use? "expanded UTF8" or "compact UTF8"?
    Use "ç" or "c+◌̧."?

    W3C and others (ex. file names) suggest to use the "composed as canonical" (take in mind C of "most compact" shorter strings)... So,

    The standard is C! in doubt use NFC

    For interoperability, and for "convention over configuration" choices, the recommendation is the use of NFC, to "canonize" external strings. To store canonical XML, for example, store it in the "FORM_C". The W3C's CSV on the Web Working Group also recomend NFC (section 7.2).

    PS: de "FORM_C" is the default form in most of libraries. Ex. in PHP's normalizer.isnormalized().


    Ther term "compostion form" (FORM_C) is used to both, to say that "a string is in the C-canonical form" (the result of a NFC transformation) and to say that a transforming algorithm is used... See http://www.macchiato.com/unicode/nfc-faq

    (...) each of the following sequences (the first two being single-character sequences) represent the same character:

    1. U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
    2. U+212B ( Å ) ANGSTROM SIGN
    3. U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE

    These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. (...) A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).


    Note: to test of normalization of little strings (pure UTF-8 or XML-entity references), you can use this test/normalize online converter.

    0 讨论(0)
  • 2020-11-29 16:07

    Normal forms (of Unicode, not databases) deal primarily (exclusively?) with characters that have diacritical marks. Unicode provides some characters with "built in" diacritical marks, such as U+00C0, "Latin Capital A with Grave". The same character can be created from a `Latin Capital A" (U+0041) with a "Combining Grave Accent" (U+0300). That means even though the two sequences produce the same resulting character, a byte-by-byte comparison will show them as being completely different.

    Normalization is an attempt at dealing with that. Normalizing assures (or at least tries to) that all the characters are encoded the same way -- either all using a separate combining diacritical mark where needed, or all using a single code point wherever possible. From a viewpoint of comparison, it doesn't really matter a whole lot which you choose -- pretty much any normalized string will compare properly with another normalized string.

    In this case, "compatibility" means compatibility with code that assumes that one code point equals one character. If you have code like that, you probably want to use the compatibility normal form. Although I've never seen it stated directly, the names of the normal forms imply that the Unicode consortium considers it preferable to use separate combining diacritical marks. This requires more intelligence to count the actual characters in a string (as well as things like breaking a string intelligently), but is more versatile.

    If you're making full use of ICU, chances are that you want to use the canonical normal form. If you're trying to write code on your own that (for example) assumes a code point equals a character, then you probably want the compatibility normal form that makes that true as often as possible.

    0 讨论(0)
  • 2020-11-29 16:07

    If two unicode strings are canonically equivalent the strings are really the same, only using different unicode sequences. For example Ä can be represented either using the character Ä or a combination of A and ◌̈.

    If the strings are only compatibility equivalent the strings aren't necessarily the same, but they may be the same in some contexts. E.g. ff could be considered same as ff.

    So, if you are comparing strings you should use canonical equivalence, because compatibility equivalence isn't real equivalence.

    But if you want to sort a set of strings it might make sense to use compatibility equivalence as the are nearly identical.

    0 讨论(0)
提交回复
热议问题