What is normalized UTF-8 all about?

后端 未结 7 900
没有蜡笔的小新
没有蜡笔的小新 2020-11-29 15:26

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

Ho

相关标签:
7条回答
  • 2020-11-29 16:07

    Whether canonical equivalence or compatibility equivalence is more relevant to you depends on your application. The ASCII way of thinking about string comparisons roughly maps to canonical equivalence, but Unicode represents a lot of languages. I don't think it is safe to assume that Unicode encodes all languages in a way that allows you to treat them just like western european ASCII.

    Figures 1 and 2 provide good examples of the two types of equivalence. Under compatibility equivalence, it looks like the same number in sub- and super- script form would compare equal. But I'm not sure that solve the same problem that as the cursive arabic form or the rotated characters.

    The hard truth of Unicode text processing is that you have to think deeply about your application's text processing requirements, and then address them as well as you can with the available tools. That doesn't directly address your question, but a more detailed answer would require linguistic experts for each of the languages you expect to support.

    0 讨论(0)
提交回复
热议问题