Nokogiri, open-uri, and Unicode Characters

前端 未结 8 1980
故里飘歌
故里飘歌 2020-11-30 01:56

I\'m using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What\'s the best way to deal with these

相关标签:
8条回答
  • 2020-11-30 02:44

    Tip: you could also use the Scrapifier gem to get metadata, as the page title, from URIs in a very simple way. The data are all encoded in UTF-8.

    Check it out: https://github.com/tiagopog/scrapifier

    Hope it's useful for you.

    0 讨论(0)
  • 2020-11-30 02:45

    Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) fixed issues I was having with parsing certain special character, specifically em-dashes.

    (The accented characters in your link came through fine in both, so don't know if this would help you with that.)

    EXAMPLE:

    url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'
    
    doc = Nokogiri::HTML(open(url))
    doc.title
    => "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
    
    doc = Nokogiri::HTML5(open(url))
    doc.title
    => "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
    
    0 讨论(0)
提交回复
热议问题