I have a text file with characters from different languages like (chinese, latin etc)
I want to remove all lines that contain these non-English characters. I want to inc
You can use Awk, provided you force the use of the C locale:
LC_CTYPE=C awk '! /[^[:alnum:][:space:][:punct:]]/' my_file
The environment variable LC_TYPE=C
(or LC_ALL=C
) force the use of the C locale for character classification. It changes the meaning of the character classes ([:alnum:]
, [:space:]
, etc.) to match only ASCII characters.
The /[^[:alnum:][:space:][:punct:]]/
regex match lines with any non ASCII character. The !
before the regex invert the condition. So only lines without any non ASCII characters will match. Then as no action is given, the default action is used for matching lines (print
).
EDIT: This can also be done with grep:
LC_CTYPE=C grep -v '[^[:alnum:][:space:][:punct:]]' my_file