How to get final URL after following HTTP redirections in pure PHP?

后端 未结 5 1932
有刺的猬
有刺的猬 2020-11-29 03:41

What I\'d like to do is find out what is the last/final URL after following the redirections.

I would prefer not to use cURL. I would like t

相关标签:
5条回答
  • 2020-11-29 04:03

    xaav answer is very good; except for the following two issues:

    • It does not support HTTPS protocol => The solution was proposed as a comment in the original site: http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/
    • Some sites will not work since they will not recognise the underlying user agent (client browser) => This is simply fixed by adding a User-agent header field: I added an Android user agent (you can find here http://www.useragentstring.com/pages/useragentstring.php other user agent examples according you your need):

      $request .= "User-Agent: Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30\r\n";

    Here's the modified answer:

    /**
     * get_redirect_url()
     * Gets the address that the provided URL redirects to,
     * or FALSE if there's no redirect. 
     *
     * @param string $url
     * @return string
     */
    function get_redirect_url($url){
        $redirect_url = null; 
    
        $url_parts = @parse_url($url);
        if (!$url_parts) return false;
        if (!isset($url_parts['host'])) return false; //can't process relative URLs
        if (!isset($url_parts['path'])) $url_parts['path'] = '/';
    
        $sock = fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
        if (!$sock) return false;
    
        $request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?'.$url_parts['query'] : '') . " HTTP/1.1\r\n"; 
        $request .= 'Host: ' . $url_parts['host'] . "\r\n"; 
        $request .= "User-Agent: Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30\r\n";
        $request .= "Connection: Close\r\n\r\n"; 
        fwrite($sock, $request);
        $response = '';
        while(!feof($sock)) $response .= fread($sock, 8192);
        fclose($sock);
    
        if (preg_match('/^Location: (.+?)$/m', $response, $matches)){
            if ( substr($matches[1], 0, 1) == "/" )
                return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
            else
                return trim($matches[1]);
    
        } else {
            return false;
        }
    
    }
    
    /**
     * get_all_redirects()
     * Follows and collects all redirects, in order, for the given URL. 
     *
     * @param string $url
     * @return array
     */
    function get_all_redirects($url){
        $redirects = array();
        while ($newurl = get_redirect_url($url)){
            if (in_array($newurl, $redirects)){
                break;
            }
            $redirects[] = $newurl;
            $url = $newurl;
        }
        return $redirects;
    }
    
    /**
     * get_final_url()
     * Gets the address that the URL ultimately leads to. 
     * Returns $url itself if it isn't a redirect.
     *
     * @param string $url
     * @return string
     */
    function get_final_url($url){
        $redirects = get_all_redirects($url);
        if (count($redirects)>0){
            return array_pop($redirects);
        } else {
            return $url;
    }
    
    0 讨论(0)
  • 2020-11-29 04:06
    /**
     * get_redirect_url()
     * Gets the address that the provided URL redirects to,
     * or FALSE if there's no redirect. 
     *
     * @param string $url
     * @return string
     */
    function get_redirect_url($url){
        $redirect_url = null; 
    
        $url_parts = @parse_url($url);
        if (!$url_parts) return false;
        if (!isset($url_parts['host'])) return false; //can't process relative URLs
        if (!isset($url_parts['path'])) $url_parts['path'] = '/';
    
        $sock = fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
        if (!$sock) return false;
    
        $request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?'.$url_parts['query'] : '') . " HTTP/1.1\r\n"; 
        $request .= 'Host: ' . $url_parts['host'] . "\r\n"; 
        $request .= "Connection: Close\r\n\r\n"; 
        fwrite($sock, $request);
        $response = '';
        while(!feof($sock)) $response .= fread($sock, 8192);
        fclose($sock);
    
        if (preg_match('/^Location: (.+?)$/m', $response, $matches)){
            if ( substr($matches[1], 0, 1) == "/" )
                return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
            else
                return trim($matches[1]);
    
        } else {
            return false;
        }
    
    }
    
    /**
     * get_all_redirects()
     * Follows and collects all redirects, in order, for the given URL. 
     *
     * @param string $url
     * @return array
     */
    function get_all_redirects($url){
        $redirects = array();
        while ($newurl = get_redirect_url($url)){
            if (in_array($newurl, $redirects)){
                break;
            }
            $redirects[] = $newurl;
            $url = $newurl;
        }
        return $redirects;
    }
    
    /**
     * get_final_url()
     * Gets the address that the URL ultimately leads to. 
     * Returns $url itself if it isn't a redirect.
     *
     * @param string $url
     * @return string
     */
    function get_final_url($url){
        $redirects = get_all_redirects($url);
        if (count($redirects)>0){
            return array_pop($redirects);
        } else {
            return $url;
        }
    }
    

    And, as always, give credit:

    http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/

    0 讨论(0)
  • 2020-11-29 04:08

    While the OP wanted to avoid cURL, it's best to use it when it's available. Here's a solution which has the following advantages

    • uses curl for all the heavy lifting, so works with https
    • copes with servers which return lower cased location header name (both xaav and webjay's answers do not handle this)
    • allows you to control how deep you want you go before giving up

    Here's the function:

    function findUltimateDestination($url, $maxRequests = 10)
    {
        $ch = curl_init();
    
        curl_setopt($ch, CURLOPT_HEADER, true);
        curl_setopt($ch, CURLOPT_NOBODY, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_MAXREDIRS, $maxRequests);
        curl_setopt($ch, CURLOPT_TIMEOUT, 15);
    
        //customize user agent if you desire...
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Link Checker)');
    
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_exec($ch);
    
        $url=curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
    
        curl_close ($ch);
        return $url;
    }
    

    Here's a more verbose version which allows you to inspect the redirection chain rather than let curl follow it.

    function findUltimateDestination($url, $maxRequests = 10)
    {
        $ch = curl_init();
    
        curl_setopt($ch, CURLOPT_HEADER, true);
        curl_setopt($ch, CURLOPT_NOBODY, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 15);
    
        //customize user agent if you desire...
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Link Checker)');
    
        while ($maxRequests--) {
    
            //fetch
            curl_setopt($ch, CURLOPT_URL, $url);
            $response = curl_exec($ch);
    
            //try to determine redirection url
            $location = '';
            if (in_array(curl_getinfo($ch, CURLINFO_HTTP_CODE), [301, 302, 303, 307, 308])) {
                if (preg_match('/Location:(.*)/i', $response, $match)) {
                    $location = trim($match[1]);
                }
            }
    
            if (empty($location)) {
                //we've reached the end of the chain...
                return $url;
            }
    
            //build next url
            if ($location[0] == '/') {
                $u = parse_url($url);
                $url = $u['scheme'] . '://' . $u['host'];
                if (isset($u['port'])) {
                    $url .= ':' . $u['port'];
                }
                $url .= $location;
            } else {
                $url = $location;
            }
        }
    
        return null;
    }
    

    As an example of redirection chain which this function handles, but the others do not, try this:

    echo findUltimateDestination('http://dx.doi.org/10.1016/j.infsof.2016.05.005')
    

    At the time of writing, this involves 4 requests, with a mixture of Location and location headers involved.

    0 讨论(0)
  • 2020-11-29 04:12

    Added to code from answers @xaav and @Houssem BDIOUI: 404 Error case and case when URL with no response. get_final_url($url) in that cases return strings: 'Error: 404 Not Found' and 'Error: No Responce'.

    /**
     * get_redirect_url()
     * Gets the address that the provided URL redirects to,
     * or FALSE if there's no redirect,
     * or 'Error: No Responce',
     * or 'Error: 404 Not Found'
     *
     * @param string $url
     * @return string
     */
    function get_redirect_url($url)
    {
        $redirect_url = null;
    
        $url_parts = @parse_url($url);
        if (!$url_parts)
            return false;
        if (!isset($url_parts['host']))
            return false; //can't process relative URLs
        if (!isset($url_parts['path']))
            $url_parts['path'] = '/';
    
        $sock = @fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
        if (!$sock) return 'Error: No Responce';
    
        $request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?' . $url_parts['query'] : '') . " HTTP/1.1\r\n";
        $request .= 'Host: ' . $url_parts['host'] . "\r\n";
        $request .= "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36\r\n";
        $request .= "Connection: Close\r\n\r\n";
        fwrite($sock, $request);
        $response = '';
        while (!feof($sock))
            $response .= fread($sock, 8192);
        fclose($sock);
    
        if (stripos($response, '404 Not Found') !== false)
        {
            return 'Error: 404 Not Found';
        }
    
        if (preg_match('/^Location: (.+?)$/m', $response, $matches))
        {
            if (substr($matches[1], 0, 1) == "/")
                return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
            else
                return trim($matches[1]);
    
        } else
        {
            return false;
        }
    
    }
    
    /**
     * get_all_redirects()
     * Follows and collects all redirects, in order, for the given URL.
     *
     * @param string $url
     * @return array
     */
    function get_all_redirects($url)
    {
        $redirects = array();
        while ($newurl = get_redirect_url($url))
        {
            if (in_array($newurl, $redirects))
            {
                break;
            }
            $redirects[] = $newurl;
            $url = $newurl;
        }
        return $redirects;
    }
    
    /**
     * get_final_url()
     * Gets the address that the URL ultimately leads to.
     * Returns $url itself if it isn't a redirect,
     * or 'Error: No Responce'
     * or 'Error: 404 Not Found',
     *
     * @param string $url
     * @return string
     */
    function get_final_url($url)
    {
        $redirects = get_all_redirects($url);
        if (count($redirects) > 0)
        {
            return array_pop($redirects);
        } else
        {
            return $url;
        }
    }
    
    0 讨论(0)
  • 2020-11-29 04:25
    function getRedirectUrl ($url) {
        stream_context_set_default(array(
            'http' => array(
                'method' => 'HEAD'
            )
        ));
        $headers = get_headers($url, 1);
        if ($headers !== false && isset($headers['Location'])) {
            return $headers['Location'];
        }
        return false;
    }
    

    Additionally...

    As was mentioned in a comment, the final item in $headers['Location'] will be your final URL after all redirects. It's important to note, though, that it won't always be an array. Sometimes it's just a run-of-the-mill, non-array variable. In this case, trying to access the last array element will most likely return a single character. Not ideal.

    If you are only interested in the final URL, after all the redirects, I would suggest changing

    return $headers['Location'];
    

    to

    return is_array($headers['Location']) ? array_pop($headers['Location']) : $headers['Location'];
    

    ... which is just if short-hand for

    if(is_array($headers['Location'])){
         return array_pop($headers['Location']);
    }else{
         return $headers['Location'];
    }
    

    This fix will take care of either case (array, non-array), and remove the need to weed-out the final URL after calling the function.

    In the case where there are no redirects, the function will return false. Similarly, the function will also return false for invalid URLs (invalid for any reason). Therefor, it is important to check the URL for validity before running this function, or else incorporate the redirect check somewhere into your validation.

    0 讨论(0)
提交回复
热议问题