How to convert a Net::HTTP response to a certain encoding in Ruby 1.9.1?

前端 未结 2 1330
你的背包
你的背包 2020-12-31 12:02

I have a Sinatra application (http://analyzethis.espace-technologies.com) that does the following

  1. Retrieve an HTML page (via net/http)
  2. Create a Nokog
相关标签:
2条回答
  • 2020-12-31 12:36

    Because Net::HTTP does not handle encoding correctly. See http://bugs.ruby-lang.org/issues/2567

    You can parse response['content-type'] which contains charset instead of parsing whole response.body.

    Then use force_encoding() to set right encoding.

    response.body.force_encoding("UTF-8") if site is served in UTF-8.

    0 讨论(0)
  • 2020-12-31 12:45

    I found the following code working for me now

    def document
      if @document.nil? && response
        @document = if document_encoding
                      Nokogiri::HTML(response.body.force_encoding(document_encoding).encode('utf-8'),nil, 'utf-8')
                    else
                      Nokogiri::HTML(response.body)
                    end
      end
      @document
    end
    
    def document_encoding
      return @document_encoding if @document_encoding
      response.type_params.each_pair do |k,v|
        @document_encoding = v.upcase if k =~ /charset/i
      end
      unless @document_encoding
        #document.css("meta[http-equiv=Content-Type]").each do |n|
        #  attr = n.get_attribute("content")
        #  @document_encoding = attr.slice(/charset=[a-z1-9\-_]+/i).split("=")[1].upcase if attr
        #end
        @document_encoding = response.body =~ /<meta[^>]*HTTP-EQUIV=["']Content-Type["'][^>]*content=["'](.*)["']/i && $1 =~ /charset=(.+)/i && $1.upcase
      end
      @document_encoding
    end 
    
    0 讨论(0)
提交回复
热议问题