Using Watir to check for bad links

后端 未结 4 1384
清歌不尽
清歌不尽 2021-01-13 02:32

I have an unordered list of links that I save off to the side, and I want to click each link and make sure it goes to a real page and doesnt 404, 500, etc.

The issue

4条回答
  •  悲&欢浪女
    2021-01-13 03:02

    There's no need to use Watir for this. A HTTP HEAD request will give you an idea whether the URL resolves and will be faster.

    Ruby's Net::HTTP can do it, or you can use Open::URI.

    Using Open::URI you can request a URI, and get a page back. Because you don't really care what the page contains, you can throw away that part and only return whether you got something:

    require 'open-uri'
    
    if (open('http://www.example.com').read.any?)
      puts "is"
    else
      puts "isn't"
    end
    

    The upside is the Open::URI resolves HTTP redirects. The downside is it returns full pages so it can be slow.

    Ruby's Net::HTTP can help somewhat, because it can use HTTP HEAD requests, which don't return the entire page, only a header. That by itself isn't enough to know whether the actual page is reachable because the HEAD response could redirect to a page that doesn't resolve, so you have to loop through the redirects until you either don't get a redirect, or you get an error. The Net::HTTP docs have an example to get you started:

    require 'net/http'
    require 'uri'
    
    def fetch(uri_str, limit = 10)
      # You should choose better exception.
      raise ArgumentError, 'HTTP redirect too deep' if limit == 0
    
      response = Net::HTTP.get_response(URI.parse(uri_str))
      case response
      when Net::HTTPSuccess     then response
      when Net::HTTPRedirection then fetch(response['location'], limit - 1)
      else
        response.error!
      end
    end
    
    print fetch('http://www.ruby-lang.org')
    

    Again, that example is returning pages, which might slow you down. You can replace get_response with request_head, which returns a response like get_response does, which should help.

    In either case, there's another thing you have to consider. A lot of sites use "meta refreshes", which cause the browser to refresh the page, using an alternate URL, after parsing the page. Handling these requires requesting the page and parsing it, looking for the tags.

    Other HTTP gems like Typhoeus and Patron also can do HEAD requests easily, so take a look at them too. In particular, Typhoeus can handle some heavy loads via its companion Hydra, allowing you to easily use parallel requests.


    EDIT:

    require 'typhoeus'
    
    response = Typhoeus::Request.head("http://www.example.com")
    response.code # => 302
    
    case response.code
    when (200 .. 299)
      #
    when (300 .. 399)
      headers = Hash[*response.headers.split(/[\r\n]+/).map{ |h| h.split(' ', 2) }.flatten]
      puts "Redirected to: #{ headers['Location:'] }"
    when (400 .. 499)
      #
    when (500 .. 599) 
      #
    end
    # >> Redirected to: http://www.iana.org/domains/example/
    

    Just in case you haven't played with one, here's what the response looks like. It's useful for exactly the sort of situation you're look at:

    (rdb:1) pp response
    # :head,
        :url => http://www.example.com,
        :headers => {"User-Agent"=>"Typhoeus - http://github.com/dbalatero/typhoeus/tree/master"},
     @requested_http_method=nil,
     @requested_url=nil,
     @start_time=nil,
     @start_transfer_time=0.109741,
     @status_message=nil,
     @time=0.109822>
    

    If you have a lot of URLs to check, see the Hydra example that is part of Typhoeus.

提交回复
热议问题