How to read only English characters

后端 未结 3 1995
闹比i
闹比i 2021-01-21 14:23

I am reading a file that sometimes has Chinese and characters of languages other than English.

How can I write a regex that only reads English words/letters?

Sh

3条回答
  •  面向向阳花
    2021-01-21 14:58

    Sometimes it's useful to use the Iconv library to deal with non-ASCII:

    require 'iconv'
    
    utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") # !> encoding option isn't portable: TRANSLIT//IGNORE
    utf8_to_ascii_translit = Iconv.new("ASCII//TRANSLIT", "UTF8") # !> encoding option isn't portable: TRANSLIT
    utf8_to_ascii_ignore = Iconv.new("ASCII//IGNORE", "UTF8") # !> encoding option isn't portable: IGNORE
    
    resume = "Résumé"
    utf8_to_latin1.iconv(resume) # => "R\xE9sum\xE9"
    utf8_to_ascii_translit.iconv(resume) # => "R'esum'e"
    utf8_to_ascii_ignore.iconv(resume) # => "Rsum"
    

    Notice that Ruby is warning that the option choices are not portable. That means there might be some damage to the string being processed; The "//TRANSLIT" and "//IGNORE" options can degrade the string but for our purpose it's OK.

    James Gray wrote a nice article about Encoding Conversion With iconv, which is useful for understanding what Iconv can do, along with dealing with UTF-8 and Unicode characters.

提交回复
热议问题