Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

我与影子孤独终老i 提交于 2019-11-29 04:22:58

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä" 
string.length
# 2
string.bytes
# [195, 131, 194, 164]

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
  :use_ssl => uri.scheme == 'https', 
  :verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
  https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')

There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding. Consequently, the following code causes your problem:

string = "Norrlandsvägen"
string.force_encoding('iso-8859-1')
puts string.encode('utf-8') # Norrlandsvägen

Whereas the following code will actually correctly encode your contents:

string = "Norrlandsvägen".encode('iso-8859-1')
string.encode!('utf-8')

Here's an example running in irb:

irb(main):023:0> string = "Norrlandsvägen".encode('iso-8859-1')
=> "Norrlandsv\xE4gen"
irb(main):024:0> string.encoding
=> #<Encoding:ISO-8859-1>
irb(main):025:0> string.encode!('utf-8')
=> "Norrlandsvägen"
irb(main):026:0> string.encoding
=> #<Encoding:UTF-8>

The above answer was spot on. Specifically this point here:

There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding.

In my situation, I had a text file with iso-8859-1 encoding. By default, Ruby uses UTF-8 encoding, so if you were to try to read the file without specifying the encoding, then you would get an error:

results = File.read(file)
results.encoding
 => #<Encoding:UTF-8> 
 results.split("\r\n")
ArgumentError: invalid byte sequence in UTF-8

You get an invalid byte sequence error because the characters in different encodings are represented by different byte lengths. Consequently, you would need to specify the encoding to the File API. Think of it like force_encoding:

results = File.read(file, encoding: "iso-8859-1")

So everything is good right? No, not if you want to start parsing the iso-8859-1 string with UTF-8 character encodings:

results = File.read(file, encoding: "iso-8859-1")
results.each do |line|
  puts line.split('¬')
end
Encoding::CompatibilityError: incompatible character encodings: ISO-8859-1 and UTF-8

Why this error? Because '¬' is represented as UTF-8. You are using a UTF-8 character sequence against an ISO-8859-1 string. They are incompatible encodings. Consequently, after you read the File as a ISO-8859-1, then you can ask Ruby to encode that ISO-8859-1 into a UTF-8. And now you will be working with UTF-8 strings and thus no problems:

results = File.read(file, encoding: "iso-8859-1").encode('UTF-8')
results.encoding
results = results.split("\r\n")
results.each do |line|
  puts line.split('¬')
end

Ultimately, with some Ruby APIs, you do not need to use force_encoding('ISO-8859-1'). Instead, you just specify the expected encoding to the API. However, you must convert it back to UTF-8 if you plan to parse it with UTF-8 strings.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!