发表新帖

发表新帖

How to read only English characters

后端未结

关注

 3  1995

闹比i 2021-01-21 14:23

I am reading a file that sometimes has Chinese and characters of languages other than English.

How can I write a regex that only reads English words/letters?

Sh

3条回答

面向向阳花 (楼主)

2021-01-21 14:58
Sometimes it's useful to use the Iconv library to deal with non-ASCII:
```
require 'iconv'

utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") # !> encoding option isn't portable: TRANSLIT//IGNORE
utf8_to_ascii_translit = Iconv.new("ASCII//TRANSLIT", "UTF8") # !> encoding option isn't portable: TRANSLIT
utf8_to_ascii_ignore = Iconv.new("ASCII//IGNORE", "UTF8") # !> encoding option isn't portable: IGNORE

resume = "Résumé"
utf8_to_latin1.iconv(resume) # => "R\xE9sum\xE9"
utf8_to_ascii_translit.iconv(resume) # => "R'esum'e"
utf8_to_ascii_ignore.iconv(resume) # => "Rsum"
```
Notice that Ruby is warning that the option choices are not portable. That means there might be some damage to the string being processed; The "//TRANSLIT" and "//IGNORE" options can degrade the string but for our purpose it's OK.

James Gray wrote a nice article about Encoding Conversion With iconv, which is useful for understanding what Iconv can do, along with dealing with UTF-8 and Unicode characters.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题