open-uri returning ASCII-8BIT from webpage encoded in iso-8859

前端 未结 1 656
小蘑菇
小蘑菇 2021-01-12 07:56

I am using open-uri to read a webpage which claims to be encoded in iso-8859-1. When I read the contents of the page, open-uri returns a string encoded in ASCII-8BIT.

<
相关标签:
1条回答
  • 2021-01-12 08:36
    • ASCII-8BIT is an alias for BINARY
    • open-uri does a funny thing: if the file is less than 10kb (or something like that), it returns a String and if it's bigger then it returns a StringIO. That can be confusing if you're trying to deal with encoding issues.

    If the files aren't huge, I would recommend manually loading them into strings:

    require 'uri'
    require 'net/http'
    require 'net/https'
    
    uri = URI.parse url_to_file
    
    http = Net::HTTP.new(uri.host, uri.port)
    if uri.scheme == 'https'
      http.use_ssl = true
      # possibly useful if you see ssl errors
      # http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
    end
    body = http.start { |session| session.get uri.request_uri }.body
    

    Then you can use the https://rubygems.org/gems/ensure-encoding gem

    require 'ensure/encoding'
    utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)
    

    I have been pretty happy with ensure-encoding... we use it in production at http://data.brighterplanet.com

    Note that you can also say :invalid_characters => :ignore instead of :transcode.

    Also, if you know the encoding somehow, you can pass :external_encoding => 'ISO-8859-1' instead of :sniff

    0 讨论(0)
提交回复
热议问题