When to use Unicode Normalization Forms NFC and NFD?

后端 未结 2 1163
温柔的废话
温柔的废话 2021-02-05 05:24

The Unicode Normalization FAQ includes the following paragraph:

Programs should always compare canonical-equivalent Unicode strings as equal ... The Unico

2条回答
  •  无人及你
    2021-02-05 05:33

    The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.

    In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.

    For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.

    There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.

    It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.

提交回复
热议问题