Transliteration in ruby

问题

What is the simplest way for transliteration of non English characters in ruby. That is conversion such as:

translit "Gévry"
#=> "Gevry"

回答1:

Ruby has an Iconv library in its stdlib which converts encodings in a very similar way to the usual iconv command

回答2:

Use the UnicodeUtils gem. This works in 1.9 and 2.0. Iconv has been deprecated in these releases.

gem install unicode_utils

Then try this in IRB:

2.0.0p0 :001 > require 'unicode_utils'  #=> true
2.0.0p0 :002 > r = "Résumé"             #=> "Résumé"
2.0.0p0 :003 > r.encoding               #=> #<Encoding:UTF-8>
2.0.0p0 :004 > UnicodeUtils.nfkd(r).gsub(/(\p{Letter})\p{Mark}+/,'\\1')
                                        #=> "Resume"

Now an explanation of how this works!

First you have to normalize the string in NFKD (Normalization Form (K)ompatability Decomposition) format. The "é" unicode codepoint, known as "latin small letter e with acute", can be represented in two ways:

é = U+00E9
é = (e = U+0065) + (acute = U+0301)

With the first form being the most popular as a single code point. The second form is the decomposed format, separating the grapheme (what appears as "é" on your screen) into its two base code points, the ASCII "e" and the acute accent mark. Unicode can compose a grapheme from many code points, which is useful in some Asian writing systems.

Note you typically want to normalize your data in a standard format for comparison, sorting, etc. In ruby the two formats of "é" here are NOT equal(). In IRB, do this:

> "\u00e9"                   #=> "é"
> "\u0065\u0301"             #=> "é"
> "\u00e9" == "\u0065\u0301" #=> false
> "\u00e9" > "\u0065\u0301"  #=> true
> "\u00e9" >= "f"            #=> true  (composed é > f)
> "\u0065\u0301" > "f"       #=> false (decomposed é < f)

> "Résumé".chars.count       #=> 6
> decomposed = UnicodeUtils.nfkd("Résumé")  
                             #=> "Résumé"
> decomposed.chars.count     #=> 8
> decomposed.length          #=> 6
> decomposed.gsub(/(\p{Letter})\p{Mark}+/,'\\1')
                             #=> "Resume"

Now that we have the string in NFKD format, we can apply a regular expression using the "property name" syntax (\p{property_name}) to match a letter followed by one or more diacritic "marks". By capturing the matching letter, we can use gsub to replace the letter+diacritics by the captured letter throughout the string.

This technique removed diacritic marks from ASCII letters and will not transliterate character sets such as Greek or Cyrillic strings into equivalent ASCII letters.

回答3:

Try taking a look at this script from TechniConseils which replaces accented characters in a string. Example of usage:

"Gévry".removeaccents #=> Gevry

来源：https://stackoverflow.com/questions/1726404/transliteration-in-ruby

标签

ruby

transliteration