Parse html GET via open() with nokogiri - redirect exception

此生再无相见时 提交于 2019-12-25 06:39:05

问题


I'm trying to learn ruby, so I'm following an exercise of google dev. I'm trying to parse some links. In the case of successful redirection (considering that I know that it its possible only to get redirected once), I get redirect forbidden. I noticed that I go from a http protocol link to an https protocol link. Any concrete idea how could I implement in this in ruby because google's exercise is for python?

error:

ruby fix.rb
redirection forbidden: http://code.google.com/edu/languages/google-python-class/images/puzzle/p-bija-baei.jpg -> https://developers.google.com/edu/python/images/puzzle/p-bija-baei.jpg?csw=1

code that should achieve what I'm looking for:

def acquireData(urls, imgs) #List item urls list of valid urls !checked, imgs list of the imgs I'll download afterwards.
  begin
    urls.each do |url|
      page = Nokogiri::HTML(open(url))
      puts page.body
    end
  rescue Exception => e
    puts e
  end
end

回答1:


Ruby's OpenURI will automatically handle redirects for you, as long as they're not "meta-refresh" that occur inside the HTML itself.

For instance, this follows a redirect automatically:

irb(main):008:0> page = open('http://www.example.org')
#<StringIO:0x00000002ae2de0>
irb(main):009:0> page.base_uri.to_s
"http://www.iana.org/domains/example"

In other words, the request to "www.example.org" got redirected to "www.iana.org" and OpenURI tracked it correctly.

If you are trying to learn HOW to handle redirects, read the Net::HTTP documentation. Here is the example how to do it from the document:

Following Redirection

Each Net::HTTPResponse object belongs to a class for its response code.

For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.

Using a case statement you can handle various types of responses properly:

def fetch(uri_str, limit = 10)
  # You should choose a better exception.
  raise ArgumentError, 'too many HTTP redirects' if limit == 0

  response = Net::HTTP.get_response(URI(uri_str))

  case response
  when Net::HTTPSuccess then
    response
  when Net::HTTPRedirection then
    location = response['location']
    warn "redirected to #{location}"
    fetch(location, limit - 1)
  else
    response.value
  end
end

print fetch('http://www.ruby-lang.org')

If you want to handle meta-refresh statements, reflect on this:

require 'nokogiri'

doc = Nokogiri::HTML(%[<meta http-equiv="refresh" content="5;URL='http://example.com/'">])
meta_refresh = doc.at('meta[http-equiv="refresh"]')
if meta_refresh
  puts meta_refresh['content'][/URL=(.+)/, 1].gsub(/['"]/, '')
end

Which outputs:

http://example.com/



回答2:


Basically the url in code.google that you're trying to open redirects to a https url. You can see that by yourself if you paste http://code.google.com/edu/languages/google-python-class/images/puzzle/p-bija-baei.jpg into your browser

Check the following bug report that explains why open-uri can't redirect to https;

So the solution to your problem is simply: use a different set of urls (that don't redirect to https)



来源:https://stackoverflow.com/questions/15118974/parse-html-get-via-open-with-nokogiri-redirect-exception

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!