Transliteration in ruby

假装没事ソ 提交于 2019-11-26 22:24:53

问题


What is the simplest way for transliteration of non English characters in ruby. That is conversion such as:

translit "Gévry"
#=> "Gevry"


回答1:


Ruby has an Iconv library in its stdlib which converts encodings in a very similar way to the usual iconv command




回答2:


Use the UnicodeUtils gem. This works in 1.9 and 2.0. Iconv has been deprecated in these releases.

gem install unicode_utils

Then try this in IRB:

2.0.0p0 :001 > require 'unicode_utils'  #=> true
2.0.0p0 :002 > r = "Résumé"             #=> "Résumé"
2.0.0p0 :003 > r.encoding               #=> #<Encoding:UTF-8>
2.0.0p0 :004 > UnicodeUtils.nfkd(r).gsub(/(\p{Letter})\p{Mark}+/,'\\1')
                                        #=> "Resume"

Now an explanation of how this works!

First you have to normalize the string in NFKD (Normalization Form (K)ompatability Decomposition) format. The "é" unicode codepoint, known as "latin small letter e with acute", can be represented in two ways:

  • é = U+00E9
  • é = (e = U+0065) + (acute = U+0301)

With the first form being the most popular as a single code point. The second form is the decomposed format, separating the grapheme (what appears as "é" on your screen) into its two base code points, the ASCII "e" and the acute accent mark. Unicode can compose a grapheme from many code points, which is useful in some Asian writing systems.

Note you typically want to normalize your data in a standard format for comparison, sorting, etc. In ruby the two formats of "é" here are NOT equal(). In IRB, do this:

> "\u00e9"                   #=> "é"
> "\u0065\u0301"             #=> "é"
> "\u00e9" == "\u0065\u0301" #=> false
> "\u00e9" > "\u0065\u0301"  #=> true
> "\u00e9" >= "f"            #=> true  (composed é > f)
> "\u0065\u0301" > "f"       #=> false (decomposed é < f)

> "Résumé".chars.count       #=> 6
> decomposed = UnicodeUtils.nfkd("Résumé")  
                             #=> "Résumé"
> decomposed.chars.count     #=> 8
> decomposed.length          #=> 6
> decomposed.gsub(/(\p{Letter})\p{Mark}+/,'\\1')
                             #=> "Resume"

Now that we have the string in NFKD format, we can apply a regular expression using the "property name" syntax (\p{property_name}) to match a letter followed by one or more diacritic "marks". By capturing the matching letter, we can use gsub to replace the letter+diacritics by the captured letter throughout the string.

This technique removed diacritic marks from ASCII letters and will not transliterate character sets such as Greek or Cyrillic strings into equivalent ASCII letters.




回答3:


Try taking a look at this script from TechniConseils which replaces accented characters in a string. Example of usage:

"Gévry".removeaccents #=> Gevry


来源:https://stackoverflow.com/questions/1726404/transliteration-in-ruby

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!