问题
What I'm trying to do is scrape the names and prices of items from multiple vendors using Nokogiri. I'm passing the CSS selectors (to the find names and prices) to Nokogiri with method arguments.
Any guidance on how to pass multiple URLs to the "scrape" method while also passing the other arguments (ex: vendor, item_path)? Or am I going about this the completely wrong way?
Here is the code:
require 'rubygems' # Load Ruby Gems
require 'nokogiri' # Load Nokogiri
require 'open-uri' # Load Open-URI
@@collection = Array.new # Array to hold meta hash
def scrape(url, vendor, item_path, name_path, price_path)
doc = Nokogiri::HTML(open(url)) # Opens URL
items = doc.css(item_path) # Sets items
items.each do |item| # Iterates through each item on grid
@@collection << meta = Hash.new # Creates a new hash then add to global array
meta[:vendor] = vendor
meta[:name] = item.css(name_path).text.strip
meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join
end
end
scrape( "page_a.html", "Sample Vendor A", "#products", ".title", ".prices")
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B", "#items", ".productname", ".price")
回答1:
You can pass multiple url's
the same way you're already doing it in you second example:
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B", "#items", ".productname", ".price")
Your scrape
method will have to iterate through those urls
, for instance:
def scrape(urls, vendor, item_path, name_path, price_path)
urls.each do |url|
doc = Nokogiri::HTML(open(url)) # Opens URL
items = doc.css(item_path) # Sets items
items.each do |item| # Iterates through each item on grid
@@collection << meta = Hash.new # Creates a new hash then add to global array
meta[:vendor] = vendor
meta[:name] = item.css(name_path).text.strip
meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join
end
end
end
This also means that the first example need also be passed as an array:
scrape( ["page_a.html"], "Sample Vendor A", "#products", ".title", ".prices")
回答2:
FYI, using @@collection
is inappropriate. Instead, write your method to return a value:
def scrape(urls, vendor, item_path, name_path, price_path)
collection = []
urls.each do |url|
doc = Nokogiri::HTML(open(url)) # Opens URL
items = doc.css(item_path) # Sets items
items.each do |item| # Iterates through each item on grid
collection << {
:vendor => vendor,
:name => item.css(name_path).text.strip,
:price => item.css(price_path).to_s.scan(/\d+[.]\d+/).join
}
end
end
collection
end
Which can be reduced to:
def scrape(urls, vendor, item_path, name_path, price_path)
urls.map { |url|
doc = Nokogiri::HTML(open(url)) # Opens URL
items = doc.css(item_path) # Sets items
items.map { |item| # Iterates through each item on grid
{
:vendor => vendor,
:name => item.css(name_path).text.strip,
:price => item.css(price_path).to_s.scan(/\d+[.]\d+/).join
}
}
}
end
来源:https://stackoverflow.com/questions/15453115/iterating-through-multiple-urls-to-parse-html-with-nokogori