Searching for non-ascii characters

后端 未结 3 781
难免孤独
难免孤独 2021-01-24 10:24

I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013 or a lower case letter a-z

相关标签:
3条回答
  • 2021-01-24 10:52

    I recommend avoiding dodgy grep -P implementations and use the real thing. This works:

    perl -CSD -nle 'print "$.: $_" if /\P{ASCII}/' utfile1 utfile2 utfile3 ...
    

    Where:

    • The -CSD options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded.

    • The $. represents the current record (line) number.

    • The $_ represents the current line.

    • The \P{ASCII} matches any code point that is not ASCII.

    0 讨论(0)
  • 2021-01-24 10:58

    A comment in How Do I grep For all non-ASCII Characters in UNIX gives the answer:

    Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want.

    That implies that the UTF-8 encoding for U+2013 (0xe2, 0x80, 0x93) is not treated by grep as parts of a single printable character outside the given range.

    The GNU grep manual's description of -P does not mention Unicode or UTF-8. Rather, it says Interpret the pattern as a Perl regular expression. (this does not mean that the result is identical to Perl, only that some of the backslash-escapes are similar).

    Perl itself can be told to use UTF-8 encoding. However the examples using Perl in Filtering invalid utf8 do not use that feature. Instead, the expressions (like those in the problematic grep) test only the individual bytes -- not the complete character.

    0 讨论(0)
  • 2021-01-24 11:02

    gawk can help you for this problem,

    here is the awk one-liner:

     awk -v FS="" 'BEGIN{for(i=1;i<128;i++)ord[sprintf("%c",i)]=i}
                   {for(i=1;i<=NF;i++)if(!($i in ord))print $i}' file
    

    below is a test with gawk:

    kent$  cat f
    abcd
    +ß
    s+äö
    ö--我
    中文
    
    kent$  awk -v FS="" 'BEGIN{for(i=1;i<128;i++)ord[sprintf("%c",i)]=i}{for(i=1;i<=NF;i++)if(!($i in ord))print $i}' f
    ß
    ä
    ö
    ö
    我
    中
    文
    
    0 讨论(0)
提交回复
热议问题