Remove lines that contain non-english (Ascii) characters from a file

前端未结

关注

 4  2116

我在风中等你

I have a text file with characters from different languages like (chinese, latin etc)

I want to remove all lines that contain these non-English characters. I want to inc

相关标签:

4条回答

长情又很酷

2021-02-13 03:23
You can use egrep -v to return only lines not matching the pattern and use something like [^ a-zA-Z0-9.,;:-'"?!] as pattern (include more punctuation as needed).

Hm, thinking about it, a double negation (-v and the inverted character class) is probably not that good. Another way might be ^[ a-zA-Z0-9.,;:-'"?!]*$.

You can also just filter for ASCII:
```
egrep -v "[^ -~]" foo.txt
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-02-13 03:25
With GNU grep, which supports perl compatible regular expressions, you can use:
```
grep -P '^[[:ascii:]]+$' file
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2021-02-13 03:28
Perl supports an [:ascii:] character class.
```
perl -nle 'print if m{^[[:ascii:]]+$}' inputfile
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2021-02-13 03:41
You can use Awk, provided you force the use of the C locale:
```
LC_CTYPE=C awk '! /[^[:alnum:][:space:][:punct:]]/' my_file
```
The environment variable LC_TYPE=C (or LC_ALL=C) force the use of the C locale for character classification. It changes the meaning of the character classes ([:alnum:], [:space:], etc.) to match only ASCII characters.

The /[^[:alnum:][:space:][:punct:]]/ regex match lines with any non ASCII character. The ! before the regex invert the condition. So only lines without any non ASCII characters will match. Then as no action is given, the default action is used for matching lines (print).

EDIT: This can also be done with grep:
```
LC_CTYPE=C grep -v '[^[:alnum:][:space:][:punct:]]' my_file
```
0 讨论(0)
发布评论:

提交评论
- 加载中...