Ruby 1.9.x replace sets of characters with specific cleaned up characters in a string

笑着哭i 提交于 2019-12-04 15:40:07

I'll make it easy for you to implement

#encoding: UTF-8
t = 'ŠšÐŽžÀÁÂÃÄAÆAÇÈÉÊËÌÎÑNÒOÓOÔOÕOÖOØOUÚUUÜUÝYÞBßSàaáaâäaaæaçcèéêëìîðñòóôõöùûýýþÿƒ'
fallback = { 
  'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj','Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A',
  'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I',
  'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U',
  'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss','à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a',
  'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i',
  'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u',
  'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 'ƒ'=>'f'
  }

p t.encode('us-ascii', :fallback => fallback)

In Ruby 1.9.3 you can use the :fallback option with encode:

"ŠšŽžÐ".encode('us-ascii', :fallback => { [your character table here] })
=> "SsZzDj"

It's also possible to do it with gsub as it accepts a conversion table as a hash argument in 1.9.x:

"ŠšŽžÐ".gsub(/[ŠšŽžÐ]/, [your character table here])
=> "SsZzDj"

Or better yet (by @steenslag):

character_table = [your table here]
regexp_keys     = Regexp.union(character_table.keys) 
"ŠšŽžÐ".gsub(regexp_keys, character_table)
=> "SsZzDj"

This sort of character conversion is called transliteration, which is good to know if you wish to google for more solutions (there are many Ruby libraries that support transliteration, but none of the ones I tested supported your character set completely).

This works as I suppose you'd like it to have: translating characters in the array and leaving those not in there as they are:

# encoding: utf-8
lookup = {'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj','Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A',
        'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I',
        'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U',
        'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss','à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a',
        'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i',
        'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u',
        'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 'ƒ'=>'f'}

clean_genre = entry["genre"].chars.to_a.map { |x|
  if lookup.has_key?(x)
    lookup[x]
  else
    x
  end
}.join

for example this:

'aŠšŽž'.chars.to_a.map { |x|
  if lookup.has_key?(x)
    lookup[x]
  else
    x
  end
}.join

gives you 'aSsZz'.

Or move the block logic into the lookup table itself (thanks to steenslag for simplifying the default proc solution!):

lookup.default_proc = proc { |hash, key| key }

then the call would look as follows:

puts 'aŠšŽž'.chars.to_a.map { |x| lookup[x] }.join

Or even better (thanks again to steenslag for pointing out):

puts 'aŠšŽž'.gsub(/./) { |x| lookup[x] }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!