I have a text file with characters from different languages like (chinese, latin etc)
I want to remove all lines that contain these non-English characters. I want to inc
You can use egrep -v
to return only lines not matching the pattern and use something like [^ a-zA-Z0-9.,;:-'"?!]
as pattern (include more punctuation as needed).
Hm, thinking about it, a double negation (-v
and the inverted character class) is probably not that good. Another way might be ^[ a-zA-Z0-9.,;:-'"?!]*$
.
You can also just filter for ASCII:
egrep -v "[^ -~]" foo.txt
With GNU grep, which supports perl compatible regular expressions, you can use:
grep -P '^[[:ascii:]]+$' file
Perl supports an [:ascii:]
character class.
perl -nle 'print if m{^[[:ascii:]]+$}' inputfile
You can use Awk, provided you force the use of the C locale:
LC_CTYPE=C awk '! /[^[:alnum:][:space:][:punct:]]/' my_file
The environment variable LC_TYPE=C
(or LC_ALL=C
) force the use of the C locale for character classification. It changes the meaning of the character classes ([:alnum:]
, [:space:]
, etc.) to match only ASCII characters.
The /[^[:alnum:][:space:][:punct:]]/
regex match lines with any non ASCII character. The !
before the regex invert the condition. So only lines without any non ASCII characters will match. Then as no action is given, the default action is used for matching lines (print
).
EDIT: This can also be done with grep:
LC_CTYPE=C grep -v '[^[:alnum:][:space:][:punct:]]' my_file