Using sed, how can a regular expression match Chinese characters?

后端 未结 2 521
没有蜡笔的小新
没有蜡笔的小新 2021-01-03 06:35

I decided to post a question, after spending quite some time and still not figuring out the problem. Also read a bunch of seemingly related posts, none really fit my simple

相关标签:
2条回答
  • 2021-01-03 06:47

    sed doesn't understand \u escape sequences (apparently). I don't know if bash-3.2 does either, but I think it does; if so, you could write

    sed $'s/\u4E9B/hello/g'
    

    but you still wouldn't be able to do the range specification.

    However, by translating to UTF-8 by hand, you could arrive at the following extended regular expression which will, I believe, match any UTF-8 sequence for a character in the range U+4E00...U+9FFF:

    (\xe4[\xb8-\xbf][\x80-\xbf]|[\xe5-\xe9][\x80-\xbf][\x80-\xbf])
    

    (But the character ranges will only work if you invoke sed in a single-byte locale, preferably the C locale.)

    With GNU sed, you get extended regular expressions if you provide the -r flag. With MacOSX I believe you need the -E flag. So you could try:

    LANG=C sed -E \
           $'s/(\xe4[\xb8-\xbf][\x80-\xbf]|[\xe5-\xe9][\x80-\xbf][\x80-\xbf])/\\1 /g' \
           <test_utf_sed.txt >test_out.txt
    

    (The above lets bash handle the \x escapes. If you leave out the $, then sed will handle the \x escapes, but you'll have to change the substitution from \\1 to \1. I don't have a Mac, nor do have the old version of bash, so I really don't know whether your sed does hex escapes or not; I'm pretty sure that your bash will, but I can't guarantee it.)


    By the way, it's not that difficult to get the utf-8 encodings for those characters; I did it with a little copy-and-paste from the original post. Eg.:

    $ hd <<<"一些"
    00000000  e4 b8 80 e4 ba 9b 0a                              |.......|
    

    It helps to know that the entire range of plane 0 ideographs (U+4E00...U+9FFF) have three-byte codes, so that 一 is E4 B8 80 and 些 is E4 BA 9B. (The 0A is, of course, a line-end.)

    0 讨论(0)
  • 2021-01-03 07:06

    Perl has pretty good support for dealing with Unicode. That might be a better bet for your task than sed. This one-liner works like your first sed example:

    perl -CIOED -p -e 's/\p{Script_Extensions=Han}/$& /g' filename
    

    The -CIOED tells perl to do its I/O in utf8. -p runs the given code once for each line of the input file, then prints the result. -e specifies a line of Perl code to run. See the documentation on command-line arguments for more.

    The regular expression uses named ranges to identify the characters to match.

    You might also want to read the Perl Unicode documentation.

    0 讨论(0)
提交回复
热议问题