I am using Net::HTTP for HTTP requests and getting a response back:
uri = URI("http://www.example.com")
http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port)
request = Net::HTTP::Get.new uri.request_uri
response = http.request request # Net::HTTPResponse object
body = response.body
If I have to use the Nokogiri gem in order to parse this HTML response I will do:
nokogiri_obj = Nokogiri::HTML(body)
But if I want to use Mechanize gem I need to do this:
agent = Mechanize.new
mechanize_obj = agent.get("http://www.example.com")
Is it possible for me to use Net::Http for getting the HTML response and then use the Mechanize gem to convert it into a Mechanize object instead of using agent.get()
?
EDIT:
The reason for getting around the agent.get()
method is because I am trying to use EventMachine::Iterator
to make concurrent EM-HTTP
requests.
EventMachine.run do
EM::Iterator.new(urls, 3).each do |url,iter|
puts "giving #{url} to httprequest now"
http = EM::HttpRequest.new(url).get
http.callback { |resp|
uri = resp.send(:URI, url)
puts "inside callback of #{url}"
body = resp.response
page = agent.parse(uri, resp, body)
}
iter.next
end
end
But its not working. I am getting an error:
/usr/local/rvm/gems/ruby-1.9.3-p194/gems/mechanize-2.5.1/lib/mechanize.rb:1165:in`parse': undefined method `[]' for #<EventMachine::HttpClient:0x0000001c18eb30> (NoMethodError)
when I use the parse
method for Net::HTTP
it works fine and I get the Mechanize object:
uri = URI("http://www.example.com")
http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port)
request = Net::HTTP::Get.new uri.request_uri
response = http.request request # Net::HTTPResponse object
body = response.body
agent = Mechanize.new
page = agent.parse(uri, response, body)
Am I passing the wrong arguments for the parse
method while using em-http?
I'm not sure why you think using Net::HTTP would be better. Mechanize will handle redirects and cookies, plus provides ready access to Nokogiri's parsed document.
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.example.com')
# Use Nokogiri to find the content of the <h1> tag...
puts page.at('h1').content # => "Example Domains"
Note, setting the user_agent
isn't necessary to reach example.com.
If you want to use a threaded engine to retrieve pages, take a look at Typhoeous and Hydra.
Looks like Mechanize
has a parse
method, so this could work:
mechanize_obj = Mechanize.parse(uri, response, body)
来源:https://stackoverflow.com/questions/12047100/ruby-mechanize-nokogiri-and-nethttp