I have a Sinatra application (http://analyzethis.espace-technologies.com) that does the following
Because Net::HTTP does not handle encoding correctly. See http://bugs.ruby-lang.org/issues/2567
You can parse response['content-type']
which contains charset instead of parsing whole response.body
.
Then use force_encoding()
to set right encoding.
response.body.force_encoding("UTF-8")
if site is served in UTF-8.
I found the following code working for me now
def document
if @document.nil? && response
@document = if document_encoding
Nokogiri::HTML(response.body.force_encoding(document_encoding).encode('utf-8'),nil, 'utf-8')
else
Nokogiri::HTML(response.body)
end
end
@document
end
def document_encoding
return @document_encoding if @document_encoding
response.type_params.each_pair do |k,v|
@document_encoding = v.upcase if k =~ /charset/i
end
unless @document_encoding
#document.css("meta[http-equiv=Content-Type]").each do |n|
# attr = n.get_attribute("content")
# @document_encoding = attr.slice(/charset=[a-z1-9\-_]+/i).split("=")[1].upcase if attr
#end
@document_encoding = response.body =~ /<meta[^>]*HTTP-EQUIV=["']Content-Type["'][^>]*content=["'](.*)["']/i && $1 =~ /charset=(.+)/i && $1.upcase
end
@document_encoding
end