Convert a unicode string to characters in Ruby?

问题

I have the following string:

l\u0092issue

My question is how to convert it to utf8 characters ?

I have tried that

1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
 => "l\u0092issue"

回答1:

You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.

In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.

However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character ’, so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.

What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.

The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.

To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:

2.1.0 :001 > s = "l\u0092issue"
 => "l\u0092issue" 
2.1.0 :002 > s = s.encode('iso-8859-1')
 => "l\x92issue" 
2.1.0 :003 > s.force_encoding('cp1252')
 => "l\x92issue" 
2.1.0 :004 > s.encode('utf-8')
 => "l’issue"

This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.

回答2:

That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.

Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

来源：https://stackoverflow.com/questions/21171782/convert-a-unicode-string-to-characters-in-ruby

标签

ruby

string