Nokogiri, open-uri, and Unicode Characters

前端未结

关注

 8  1980

I\'m using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What\'s the best way to deal with these

相关标签:

8条回答

夕颜

2020-11-30 02:44

Tip: you could also use the Scrapifier gem to get metadata, as the page title, from URIs in a very simple way. The data are all encoded in UTF-8.

Check it out: https://github.com/tiagopog/scrapifier

Hope it's useful for you.

0 讨论(0)
发布评论:

提交评论
- 加载中...

长情又很酷

2020-11-30 02:45

Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) fixed issues I was having with parsing certain special character, specifically em-dashes.

(The accented characters in your link came through fine in both, so don't know if this would help you with that.)

EXAMPLE:

url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'

doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

0 讨论(0)

上一页 1 2