String normalization in pure bash

穿精又带淫゛_ 提交于 2019-12-11 10:48:46

问题


The characters 'É' (E\xcc\x81) and 'É' (\xc3\x89) have different code points. They look identical, yet when testing for a match the result is negative.

Python can normalize them, though: unicodedata.normalize('NFC', 'É'.decode('utf-8')) == unicodedata.normalize('NFC', 'É'.decode('utf-8')) returns True. And it prints as É.

Question: is there a way to normalize strings in pure bash*? I've looked into iconv but as far as I know it can do a conversion to ascii but no normalization.

*GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14))


回答1:


If you have uconv available, that'll probably do the job:

$ echo -en "E\xcc\x81" | uconv -x Any-NFC | hexdump -C
00000000  c3 89
$ echo -en "\xc3\x89" | uconv -x Any-NFC | hexdump -C
00000000  c3 89

Any-NFD is also available for the decomposed form.



来源:https://stackoverflow.com/questions/36425326/string-normalization-in-pure-bash

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!