问题
The characters 'É' (E\xcc\x81
) and 'É' (\xc3\x89
) have different code points. They look identical, yet when testing for a match the result is negative.
Python can normalize them, though: unicodedata.normalize('NFC', 'É'.decode('utf-8')) == unicodedata.normalize('NFC', 'É'.decode('utf-8'))
returns True
. And it prints as É.
Question: is there a way to normalize strings in pure bash*? I've looked into iconv
but as far as I know it can do a conversion to ascii but no normalization.
*GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14))
回答1:
If you have uconv available, that'll probably do the job:
$ echo -en "E\xcc\x81" | uconv -x Any-NFC | hexdump -C
00000000 c3 89
$ echo -en "\xc3\x89" | uconv -x Any-NFC | hexdump -C
00000000 c3 89
Any-NFD
is also available for the decomposed form.
来源:https://stackoverflow.com/questions/36425326/string-normalization-in-pure-bash