Remove lines that contain non-english (Ascii) characters from a file

前端 未结 4 2116
我在风中等你
我在风中等你 2021-02-13 02:48

I have a text file with characters from different languages like (chinese, latin etc)

I want to remove all lines that contain these non-English characters. I want to inc

相关标签:
4条回答
  • 2021-02-13 03:23

    You can use egrep -v to return only lines not matching the pattern and use something like [^ a-zA-Z0-9.,;:-'"?!] as pattern (include more punctuation as needed).

    Hm, thinking about it, a double negation (-v and the inverted character class) is probably not that good. Another way might be ^[ a-zA-Z0-9.,;:-'"?!]*$.

    You can also just filter for ASCII:

    egrep -v "[^ -~]" foo.txt
    
    0 讨论(0)
  • With GNU grep, which supports perl compatible regular expressions, you can use:

    grep -P '^[[:ascii:]]+$' file
    
    0 讨论(0)
  • 2021-02-13 03:28

    Perl supports an [:ascii:] character class.

    perl -nle 'print if m{^[[:ascii:]]+$}' inputfile
    
    0 讨论(0)
  • 2021-02-13 03:41

    You can use Awk, provided you force the use of the C locale:

    LC_CTYPE=C awk '! /[^[:alnum:][:space:][:punct:]]/' my_file
    

    The environment variable LC_TYPE=C (or LC_ALL=C) force the use of the C locale for character classification. It changes the meaning of the character classes ([:alnum:], [:space:], etc.) to match only ASCII characters.

    The /[^[:alnum:][:space:][:punct:]]/ regex match lines with any non ASCII character. The ! before the regex invert the condition. So only lines without any non ASCII characters will match. Then as no action is given, the default action is used for matching lines (print).

    EDIT: This can also be done with grep:

    LC_CTYPE=C grep -v '[^[:alnum:][:space:][:punct:]]' my_file
    
    0 讨论(0)
提交回复
热议问题