How do I check for valid (not dead) links programmatically using PHP?

后端 未结 9 1741
独厮守ぢ
独厮守ぢ 2020-12-08 08:24

Given a list of urls, I would like to check that each url:

  • Returns a 200 OK status code
  • Returns a response within X amount of time

The

相关标签:
9条回答
  • 2020-12-08 08:47

    One potential problem you will undoubtably run into is when the box this script is running on looses access to the Internet... you'll get 1000 false positives.

    It would probably be better for your script to keep some type of history and only report a failure after 5 days of failure.

    Also, the script should be self-checking in some way (like checking a known good web site [google?]) before continuing with the standard checks.

    0 讨论(0)
  • 2020-12-08 08:48

    Seems like it might be a job for curl.

    If you're not stuck on PHP Perl's LWP might be an answer too.

    0 讨论(0)
  • 2020-12-08 08:53

    Just returning a 200 response is not enough; many valid links will continue to return "200" after they change into porn / gambling portals when the former owner fails to renew.

    Domain squatters typically ensure that every URL in their domains returns 200.

    0 讨论(0)
  • 2020-12-08 08:58

    You should also be aware of URLs returning 301 or 302 HTTP responses which redirect to another page. Generally this doesn't mean the link is invalid. For example, http://amazon.com returns 301 and redirects to http://www.amazon.com/.

    0 讨论(0)
  • 2020-12-08 09:01

    Use the PHP cURL extension. Unlike fopen() it can also make HTTP HEAD requests which are sufficient to check the availability of a URL and save you a ton of bandwith as you don't have to download the entire body of the page to check.

    As a starting point you could use some function like this:

    function is_available($url, $timeout = 30) {
        $ch = curl_init(); // get cURL handle
    
        // set cURL options
        $opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
                      CURLOPT_URL => $url,            // set URL
                      CURLOPT_NOBODY => true,         // do a HEAD request only
                      CURLOPT_TIMEOUT => $timeout);   // set timeout
        curl_setopt_array($ch, $opts); 
    
        curl_exec($ch); // do it!
    
        $retval = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200; // check if HTTP OK
    
        curl_close($ch); // close handle
    
        return $retval;
    }
    

    However, there's a ton of possible optimizations: You might want to re-use the cURL instance and, if checking more than one URL per host, even re-use the connection.

    Oh, and this code does check strictly for HTTP response code 200. It does not follow redirects (302) -- but there also is a cURL-option for that.

    0 讨论(0)
  • 2020-12-08 09:06
    1. fopen() supports http URI.
    2. If you need more flexibility (such as timeout), look into the cURL extension.
    0 讨论(0)
提交回复
热议问题