Unescaping characters in a string with Ruby

孤者浪人 提交于 2019-12-01 09:13:20

I ran into this exact problem the other day. There is a bug in the json parser that HTTParty uses (Crack gem) - basically it uses a case-sensitive regexp for the Unicode sequences, so because Posterous puts out A-F instead of a-f, Crack isn't unescaping them. I submitted a pull request to fix this.

In the meantime HTTParty nicely lets you specify alternate parsers so you can do ::JSON.parse bypassing Crack entirely like this:

class JsonParser < HTTParty::Parser
  def json
    ::JSON.parse(body)
  end
end

class Posterous
   include HTTParty
   parser ::JsonParser

   #....
end

I've found a solution to this problem. I ran across this gist. elskwid had the identical problem and ran the string through a JSON parser:

s = ::JSON.parse("\\u003Cp\\u003E")

Now, s = "<p>".

You can also use pack:

"a\\u00e4\\u3042".gsub(/\\u(....)/){[$1.hex].pack("U")} # "aäあ"

Or to do the reverse:

"aäあ".gsub(/[^ -~\n]/){"\\u%04x"%$&.ord} # "a\\u00e4\\u3042"

The doubled-backslashes almost look like a regular string being viewed in a debugger.

The string "\u003Cp\u003E" really is "<p>", only the \u003C is unicode for < and \003E is >.

>> "\u003Cp\u003E"  #=> "<p>"

If you are truly getting the string with doubled backslashes then you could try stripping one of the pair.

As a test, see how long the string is:

>> "\\u003Cp\\u003E".size #=> 13
>> "\u003Cp\u003E".size #=> 3
>> "<p>".size #=> 3

All the above was done using Ruby 1.9.2, which is Unicode aware. v1.8.7 wasn't. Here's what I get using 1.8.7's IRB for comparison:

>> "\u003Cp\u003E" #=> "u003Cpu003E"
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!