Nokogiri, open-uri, and Unicode Characters

前端 未结 8 1979
故里飘歌
故里飘歌 2020-11-30 01:56

I\'m using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What\'s the best way to deal with these

相关标签:
8条回答
  • 2020-11-30 02:21

    I was having the same problem and the Iconv approach wasn't working. Nokogiri::HTML is an alias to Nokogiri::HTML.parse(thing, url, encoding, options).

    So, you just need to do:

    doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')

    and it'll convert the page encoding properly to utf-8. You'll see Ragù instead of Rag\303\271.

    0 讨论(0)
  • 2020-11-30 02:21

    Just to add a cross-reference, this SO page gives some related information:

    How to make Nokogiri transparently return un/encoded Html entities untouched?

    0 讨论(0)
  • 2020-11-30 02:25

    Try setting the encoding option of Nokogiri, like so:

    require 'open-uri'
    require 'nokogiri'
    doc = Nokogiri::HTML(open(link))
    doc.encoding = 'utf-8'
    title = doc.at_css("title")
    
    0 讨论(0)
  • 2020-11-30 02:26

    Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read and pass the resulting string to Nokogiri.

    Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo". But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:

    # encoding: UTF-8
    require 'nokogiri'
    require 'open-uri'
    
    doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
    doc.encoding = 'utf-8'
    h52 = doc.css('h5')[1]
    puts h52.text, h52.text.encoding
    #=> Genealogà a de Jesucristo
    #=> UTF-8
    

    We can see that this is not the fault of open-uri:

    html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
    gene = html.read[/Gene\S+/]
    puts gene, gene.encoding
    #=> Genealogía
    #=> UTF-8
    

    This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:

    # encoding: UTF-8
    require 'nokogiri'
    require 'open-uri'
    
    html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
    doc = Nokogiri::HTML(html.read)
    doc.encoding = 'utf-8'
    h52 = doc.css('h5')[1].text
    puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
    #=> Genealogía de Jesucristo
    #=> UTF-8
    #=> true
    
    0 讨论(0)
  • 2020-11-30 02:31

    You need to convert the response from the website being scraped (here epicurious.com) into utf-8 encoding.

    as per the html content from the page being scraped, its "ISO-8859-1" for now. So, you need to do something like this:

    require 'iconv'
    doc = Nokogiri::HTML(Iconv.conv('utf-8//IGNORE', 'ISO-8859-1', open(link).read))
    

    Read more about it here: http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping

    0 讨论(0)
  • 2020-11-30 02:44

    When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.

    If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.

    If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.

    For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string for 1.8, you probably need to look at Iconv.

    Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:

    require 'win32ole'
    
    WIN32OLE.codepage = WIN32OLE::CP_UTF8
    

    If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.

    Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.

    0 讨论(0)
提交回复
热议问题