Convert non-ASCII chars from ASCII-8BIT to UTF-8

后端 未结 4 1407
鱼传尺愫
鱼传尺愫 2021-02-01 12:46

I\'m pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

Here is an example of some offending text:



        
4条回答
  •  心在旅途
    2021-02-01 13:44

    There are two possibilities:

    1. The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.

      For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.

    2. The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.

      For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')

提交回复
热议问题