String normalization in pure bash

问题

The characters 'É' (E\xcc\x81) and 'É' (\xc3\x89) have different code points. They look identical, yet when testing for a match the result is negative.

Python can normalize them, though: unicodedata.normalize('NFC', 'É'.decode('utf-8')) == unicodedata.normalize('NFC', 'É'.decode('utf-8')) returns True. And it prints as É.

Question: is there a way to normalize strings in pure bash*? I've looked into iconv but as far as I know it can do a conversion to ascii but no normalization.

*GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14))

回答1:

If you have uconv available, that'll probably do the job:

$ echo -en "E\xcc\x81" | uconv -x Any-NFC | hexdump -C
00000000  c3 89
$ echo -en "\xc3\x89" | uconv -x Any-NFC | hexdump -C
00000000  c3 89

Any-NFD is also available for the decomposed form.

来源：https://stackoverflow.com/questions/36425326/string-normalization-in-pure-bash

标签

bash

unicode

utf-8

freebsd

iconv

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!