Why does Rails 3 think xE2x80x89 means â x80 x89

喜夏-厌秋 提交于 2019-12-08 08:14:57

问题


I have a field scraped from a utf-8 page:

"O’Reilly"

And saved in a yml file:

:name: "O\xE2\x80\x99Reilly"

(xE2x80x99 is the correct UTF-8 representation of this apostrophe)

However when I load the value into a hash and yield it to a page tagged as utf-8, I get:

OâReilly

I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.

How do I make rails interpret a 3-byte UTF-8 code as a single character?


回答1:


Ruby strings are sequences of bytes instead of characters:

$ irb
>> "O\xE2\x80\x99Reilly"
=> "O\342\200\231Reilly"

Your string is a sequence of 10 bytes but 8 characters (as you know). The safest way to see that you output the correct string in HTML (I assume you want HTML since you mentioned Rails) is to convert non-printable characters to HTML entities; in your case to

O’Reilly

This takes some work but it should help in cases where send your HTML in UTF-8 but your end-user has set his or her browser to override and show Latin-1 or some other silly restricted charset.




回答2:


Ultimately this was caused by loading a syck file (generated by an external script) with psych (in rails). Loading with syck solved the issue:

#in ruby environment
puts YAML::ENGINE.yamler => syck

#in rails
puts YAML::ENGINE.yamler => psych

#in webapp
YAML::ENGINE.yamler = 'syck'
a = YAML::load(file_saved_with_syck)
a[index][:name] => "O’Reilly"
YAML::ENGINE.yamler = 'psych'



回答3:


I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.

It's not really UTF-16, which is rarely used on the web (and largely breaks there). Your app is outputting three Unicode characters (including the two invisible control codes), but that's not the same thing as the UTF-16 encoding.

The problem would seem to be that the YAML file is being read in as if it were ISO-8859-1-encoded, so that the \xE2 byte maps to character U+00E2 and so on. I am guessing you are using Ruby 1.9 and the YAML is being parsed into byte strings with associated ASCII-8BIT encoding instead of UTF-8, causing the strings to undergo a round of trancoding (mangling) later.

If this is the case you might have to force_encoding the read strings back to what they should have been, or set default_internal to cause the strings to be read back into UTF-8. Bit of a mess this.



来源:https://stackoverflow.com/questions/6616229/why-does-rails-3-think-xe2x80x89-means-%c3%a2-x80-x89

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!