How can I check if a URL exists via PHP?

前端 未结 22 1199
天涯浪人
天涯浪人 2020-11-22 04:13

How do I check if a URL exists (not 404) in PHP?

相关标签:
22条回答
  • 2020-11-22 05:01

    karim79's get_headers() solution didn't worked for me as I gotten crazy results with Pinterest.

    get_headers(): SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
    
    Array
    (
        [url] => https://www.pinterest.com/jonathan_parl/
        [exists] => 
    )
    
    get_headers(): Failed to enable crypto
    
    Array
    (
        [url] => https://www.pinterest.com/jonathan_parl/
        [exists] => 
    )
    
    get_headers(https://www.pinterest.com/jonathan_parl/): failed to open stream: operation failed
    
    Array
    (
        [url] => https://www.pinterest.com/jonathan_parl/
        [exists] => 
    ) 
    

    Anyway, this developer demonstrates that cURL is way faster than get_headers():

    http://php.net/manual/fr/function.get-headers.php#104723

    Since many people asked for karim79 to fix is cURL solution, here's the solution I built today.

    /**
    * Send an HTTP request to a the $url and check the header posted back.
    *
    * @param $url String url to which we must send the request.
    * @param $failCodeList Int array list of code for which the page is considered invalid.
    *
    * @return Boolean
    */
    public static function isUrlExists($url, array $failCodeList = array(404)){
    
        $exists = false;
    
        if(!StringManager::stringStartWith($url, "http") and !StringManager::stringStartWith($url, "ftp")){
    
            $url = "https://" . $url;
        }
    
        if (preg_match(RegularExpression::URL, $url)){
    
            $handle = curl_init($url);
    
    
            curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
    
            curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
    
            curl_setopt($handle, CURLOPT_HEADER, true);
    
            curl_setopt($handle, CURLOPT_NOBODY, true);
    
            curl_setopt($handle, CURLOPT_USERAGENT, true);
    
    
            $headers = curl_exec($handle);
    
            curl_close($handle);
    
    
            if (empty($failCodeList) or !is_array($failCodeList)){
    
                $failCodeList = array(404); 
            }
    
            if (!empty($headers)){
    
                $exists = true;
    
                $headers = explode(PHP_EOL, $headers);
    
                foreach($failCodeList as $code){
    
                    if (is_numeric($code) and strpos($headers[0], strval($code)) !== false){
    
                        $exists = false;
    
                        break;  
                    }
                }
            }
        }
    
        return $exists;
    }
    

    Let me explains the curl options:

    CURLOPT_RETURNTRANSFER: return a string instead of displaying the calling page on the screen.

    CURLOPT_SSL_VERIFYPEER: cUrl won't checkout the certificate

    CURLOPT_HEADER: include the header in the string

    CURLOPT_NOBODY: don't include the body in the string

    CURLOPT_USERAGENT: some site needs that to function properly (by example : https://plus.google.com)


    Additional note: In this function I'm using Diego Perini's regex for validating the URL before sending the request:

    const URL = "%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu"; //@copyright Diego Perini
    

    Additional note 2: I explode the header string and user headers[0] to be sure to only validate only the return code and message (example: 200, 404, 405, etc.)

    Additional note 3: Sometime validating only the code 404 is not enough (see the unit test), so there's an optional $failCodeList parameter to supply all the code list to reject.

    And, of course, here's the unit test (including all the popular social network) to legitimates my coding:

    public function testIsUrlExists(){
    
    //invalid
    $this->assertFalse(ToolManager::isUrlExists("woot"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque4545646456"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque890800"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://instagram.com/mariloubiz1232132/", array(404, 405)));
    
    $this->assertFalse(ToolManager::isUrlExists("https://www.pinterest.com/jonathan_parl1231/"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://regex101.com/546465465456"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://twitter.com/arcadefire4566546"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://vimeo.com/**($%?%$", array(400, 405)));
    
    $this->assertFalse(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666456456456"));
    
    
    //valid
    $this->assertTrue(ToolManager::isUrlExists("www.google.ca"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://instagram.com/mariloubiz/"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://www.pinterest.com/"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://regex101.com"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://twitter.com/arcadefire"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://vimeo.com/"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666"));
    }
    

    Great success to all,

    Jonathan Parent-Lévesque from Montreal

    0 讨论(0)
  • 2020-11-22 05:01

    to check if url is online or offline ---

    function get_http_response_code($theURL) {
        $headers = @get_headers($theURL);
        return substr($headers[0], 9, 3);
    }
    
    0 讨论(0)
  • 2020-11-22 05:01
    function url_exists($url) {
        $headers = @get_headers($url);
        return (strpos($headers[0],'200')===false)? false:true;
    }
    
    0 讨论(0)
  • 2020-11-22 05:01

    I run some tests to see if links on my site are valid - alerts me to when third parties change their links. I was having an issue with a site that had a poorly configured certificate that meant that php's get_headers didn't work.

    SO, I read that curl was faster and decided to give that a go. then i had an issue with linkedin which gave me a 999 error, which turned out to be a user agent issue.

    I didn't care if the certificate was not valid for this test, and i didn't care if the response was a re-direct.

    Then I figured use get_headers anyway if curl was failing....

    Give it a go....

    /**
     * returns true/false if the $url is present.
     *
     * @param string $url assumes this is a valid url.
     *
     * @return bool
     */
    private function url_exists (string $url): bool
    {
      $ch = curl_init($url);
      curl_setopt($ch, CURLOPT_URL, $url);
      curl_setopt($ch, CURLOPT_NOBODY, TRUE);             // this does a head request to make it faster.
      curl_setopt($ch, CURLOPT_HEADER, TRUE);             // just the headers
      curl_setopt($ch, CURLOPT_SSL_VERIFYSTATUS, FALSE);  // turn off that pesky ssl stuff - some sys admins can't get it right.
      curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
      // set a real user agent to stop linkedin getting upset.
      curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36');
      curl_exec($ch);
      $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
      if (($http_code >= HTTP_OK && $http_code < HTTP_BAD_REQUEST) || $http_code === 999)
      {
        curl_close($ch);
        return TRUE;
      }
      $error = curl_error($ch); // used for debugging.
      curl_close($ch);
      // just try the get_headers - it might work!
      stream_context_set_default(array('http' => array('method' => 'HEAD')));
      $file_headers = @get_headers($url);
      if ($file_headers)
      {
        $response_code = substr($file_headers[0], 9, 3);
        return $response_code >= 200 && $response_code < 400;
      }
      return FALSE;
    }
    
    0 讨论(0)
  • Here:

    $file = 'http://www.example.com/somefile.jpg';
    $file_headers = @get_headers($file);
    if(!$file_headers || $file_headers[0] == 'HTTP/1.1 404 Not Found') {
        $exists = false;
    }
    else {
        $exists = true;
    }
    

    From here and right below the above post, there's a curl solution:

    function url_exists($url) {
        return curl_init($url) !== false;
    }
    
    0 讨论(0)
  • 2020-11-22 05:03

    I use this function:

    /**
     * @param $url
     * @param array $options
     * @return string
     * @throws Exception
     */
    function checkURL($url, array $options = array()) {
        if (empty($url)) {
            throw new Exception('URL is empty');
        }
    
        // list of HTTP status codes
        $httpStatusCodes = array(
            100 => 'Continue',
            101 => 'Switching Protocols',
            102 => 'Processing',
            200 => 'OK',
            201 => 'Created',
            202 => 'Accepted',
            203 => 'Non-Authoritative Information',
            204 => 'No Content',
            205 => 'Reset Content',
            206 => 'Partial Content',
            207 => 'Multi-Status',
            208 => 'Already Reported',
            226 => 'IM Used',
            300 => 'Multiple Choices',
            301 => 'Moved Permanently',
            302 => 'Found',
            303 => 'See Other',
            304 => 'Not Modified',
            305 => 'Use Proxy',
            306 => 'Switch Proxy',
            307 => 'Temporary Redirect',
            308 => 'Permanent Redirect',
            400 => 'Bad Request',
            401 => 'Unauthorized',
            402 => 'Payment Required',
            403 => 'Forbidden',
            404 => 'Not Found',
            405 => 'Method Not Allowed',
            406 => 'Not Acceptable',
            407 => 'Proxy Authentication Required',
            408 => 'Request Timeout',
            409 => 'Conflict',
            410 => 'Gone',
            411 => 'Length Required',
            412 => 'Precondition Failed',
            413 => 'Payload Too Large',
            414 => 'Request-URI Too Long',
            415 => 'Unsupported Media Type',
            416 => 'Requested Range Not Satisfiable',
            417 => 'Expectation Failed',
            418 => 'I\'m a teapot',
            422 => 'Unprocessable Entity',
            423 => 'Locked',
            424 => 'Failed Dependency',
            425 => 'Unordered Collection',
            426 => 'Upgrade Required',
            428 => 'Precondition Required',
            429 => 'Too Many Requests',
            431 => 'Request Header Fields Too Large',
            449 => 'Retry With',
            450 => 'Blocked by Windows Parental Controls',
            500 => 'Internal Server Error',
            501 => 'Not Implemented',
            502 => 'Bad Gateway',
            503 => 'Service Unavailable',
            504 => 'Gateway Timeout',
            505 => 'HTTP Version Not Supported',
            506 => 'Variant Also Negotiates',
            507 => 'Insufficient Storage',
            508 => 'Loop Detected',
            509 => 'Bandwidth Limit Exceeded',
            510 => 'Not Extended',
            511 => 'Network Authentication Required',
            599 => 'Network Connect Timeout Error'
        );
    
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_NOBODY, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    
        if (isset($options['timeout'])) {
            $timeout = (int) $options['timeout'];
            curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
        }
    
        curl_exec($ch);
        $returnedStatusCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);
    
        if (array_key_exists($returnedStatusCode, $httpStatusCodes)) {
            return "URL: '{$url}' - Error code: {$returnedStatusCode} - Definition: {$httpStatusCodes[$returnedStatusCode]}";
        } else {
            return "'{$url}' does not exist";
        }
    }
    
    0 讨论(0)
提交回复
热议问题