HTML is read before fully loaded using open-uri and nokogiri

纵饮孤独 提交于 2019-12-12 08:47:44

问题


I'm using open-uri and nokogiri with ruby to do some simple webcrawling. There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar. What is the best way to tell open-uri or nokogiri to wait until the page is fully loaded?

Currently my script looks like:

require 'nokogiri'
require 'open-uri'

url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)) 
puts doc.at_css("h2").text

回答1:


What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.

I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser

require 'nokogiri'
require 'watir'

browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'

doc = Nokogiri::HTML.parse(browser.html)

This might open a browser window though.



来源:https://stackoverflow.com/questions/13789583/html-is-read-before-fully-loaded-using-open-uri-and-nokogiri

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!