问题
I am trying to scrape the five latest stories from CNN.com and retrieve their links along with the first paragraph of each story. I have this simple script:
url = "http://edition.cnn.com/?refresh=1"
agent = Mechanize.new
agent.get("http://edition.cnn.com/?refresh=1").search("//div[@id='cnn_maintt2bul']/div/div/ul/li[count(*)=3]/a").each do |headline|
article = headline.text
link = URI.join(url, headline[:href]).to_s
page = headline.click(link)
paragraph1 = page.at_css(".adtag15090+ p").text
puts "#{article}"
puts "#{link}"
puts "#{paragraph1}"
puts "\n"
end
This code won't work because the click
method would not be recognized. It would bring this error:
cnn_scraper.rb:10:in `block in <main>': undefined method `click' for #<Nokogiri:
:XML::Element:0x2c49b40> (NoMethodError)
The first paragraphs of all articles on CNN.com have the selector .adtag15090+ p
. Also notice that it is parsing all articles and yet I want only five. Any ideas about how to get the first five and their first paragraphs using Nokogiri and Mechanize?
来源:https://stackoverflow.com/questions/22055544/getting-visiting-and-limiting-the-number-of-links-using-nokogiri-and-mechanize