open-uri returning ASCII-8BIT from webpage encoded in iso-8859

大憨熊 提交于 2019-12-01 04:43:07
  • ASCII-8BIT is an alias for BINARY
  • open-uri does a funny thing: if the file is less than 10kb (or something like that), it returns a String and if it's bigger then it returns a StringIO. That can be confusing if you're trying to deal with encoding issues.

If the files aren't huge, I would recommend manually loading them into strings:

require 'uri'
require 'net/http'
require 'net/https'

uri = URI.parse url_to_file

http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == 'https'
  http.use_ssl = true
  # possibly useful if you see ssl errors
  # http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body

Then you can use the https://rubygems.org/gems/ensure-encoding gem

require 'ensure/encoding'
utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)

I have been pretty happy with ensure-encoding... we use it in production at http://data.brighterplanet.com

Note that you can also say :invalid_characters => :ignore instead of :transcode.

Also, if you know the encoding somehow, you can pass :external_encoding => 'ISO-8859-1' instead of :sniff

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!