How to replace Unicode characters with ASCII

前端 未结 4 761
被撕碎了的回忆
被撕碎了的回忆 2021-02-15 18:35

I have the following command to replace Unicode characters with ASCII ones.

sed -i \'s/Ã/A/g\'

The problem is à isn\'t recognized

相关标签:
4条回答
  • 2021-02-15 18:43

    You can use iconv:

    iconv -f utf-8 -t ascii//translit
    
    0 讨论(0)
  • 2021-02-15 19:01

    It is possible to use hex values in "sed".

    echo "Ã" | hexdump -C
    00000000  c3 83 0a                                          |...|
    00000003
    

    Ok, that character is two byte combination "c3 83". Let's replace it with single byte "A":

    echo "Ã" |sed 's/\xc3\x83/A/g'
    A
    

    Explanation: \x indicates for "sed" that a hex code follows.

    0 讨论(0)
  • 2021-02-15 19:03

    Try setting LANG=C and then run it over the Unicode range:
    echo "hi ☠ there ☠" | LANG=C sed "s/[\x80-\xFF]//g"

    0 讨论(0)
  • 2021-02-15 19:04

    There is also uconv, from ICU.

    Examples:

    • uconv -x "::NFD; [:Nonspacing Mark:] > ; ::NFC;": to remove accents
    • uconv -x "::Latin; ::Latin-ASCII;": for a transliteration latin/ascii
    • uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;": for a transliteration latin/ascii and removal of remaining code points > 0x7F
    • ...

    echo "À l'école ☠" | uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;" gives: A l'ecole

    0 讨论(0)
提交回复
热议问题