I'm using open-uri
and nokogiri
with ruby to do some simple webcrawling.
There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar.
What is the best way to tell open-uri
or nokogiri
to wait until the page is fully loaded?
Currently my script looks like:
require 'nokogiri'
require 'open-uri'
url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE))
puts doc.at_css("h2").text
What you describe is not possible. The result of open
will only be passed to HTML
after the open
method as returned the full value.
I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser
require 'nokogiri'
require 'watir'
browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'
doc = Nokogiri::HTML.parse(browser.html)
This might open a browser window though.
来源:https://stackoverflow.com/questions/13789583/html-is-read-before-fully-loaded-using-open-uri-and-nokogiri