How do I get the final, redirected, canonical URL of a website using PHP?

前端 未结 3 473
无人及你
无人及你 2021-02-07 15:49

In the days of link shorteners and Ajax, there can be many links that ultimately point to the same content. I was wondering what the best way is to get the final, best link for

相关标签:
3条回答
  • 2021-02-07 16:28

    I wrote you a little function to do it. It's simple, but it may be a starting point for you. Note: the http://dlvr.it/xxb0W url returns an invalid URL for it's Location response header.

    You'll need the Altumo PHP library for it to work. It's a library that I wrote, but it's MIT license, as is this function.

    See: https://github.com/homer6/altumo

    Also, you'll have to wrap the function in a try/catch.

    /**
    * Gets the final URL of a URL that will be redirected.
    * 
    * @param string $url_string
    * @throws \Exception                    //on error
    * @return string
    */
    function get_final_url( $url_string ){
    
        while( 1 ){
    
            //validate URL
                $url = new \Altumo\String\Url( $url_string );
    
            //get the Location response header of the URL
                $client = new \Altumo\Http\OutgoingHttpRequest( $url_string );
                $response = $client->sendAndGetResponseMessage();
                $location = $response->getHeader( 'Location' );
    
            //return the URL if no Location header was found, else continue
                if( is_null($location) ){
                    return $url_string;
                }else{
                    $url_string = $location;
                }
    
        }
    
    }
    
    echo get_final_url( 'your url here' );
    

    Please let me know if you'd like further modifications or help getting it going.

    0 讨论(0)
  • 2021-02-07 16:29

    Since I wasn't able to find any libraries that really did what I was looking for, and I was hoping to do more than just follow HTTP redirects, I have gone ahead and created a library that accomplishes the goals and released it under the MIT license. You can get it here:

    https://github.com/mattwright/URLResolver.php

    URLResolver.php is a PHP class that attempts to resolve URLs to a final, canonical link:

    • Follows 301 and 302 redirects found in HTTP headers
    • Follows Open Graph URL <meta> tags found in web page <head>
    • Follows Canonical URL <link> tags found in web page <head>
    • Aborts download quickly if content type is not an HTML page

    I am certainly not an expert on the rules of HTTP redirection, so if anyone has suggestions on how to improve this library, it would be greatly appreciated. I have tested in on thousands of URLs and it seems to do pretty well. I followed Mario's advice and used PHP Simple HTML Parser library where needed.

    0 讨论(0)
  • 2021-02-07 16:38

    Using Guzzle (a well known and robust HTTP client) you can do it like that:

    <?php
    use Guzzle\Http\Client as GuzzleClient;
    use Guzzle\Plugin\History\HistoryPlugin;
    
    public function resolveUrl($url)
    {
        $client   = new GuzzleClient($url);
        $history  = new HistoryPlugin();
        $client->addSubscriber($history);
    
        $response = $client->head($url)->send();
    
        if (!$response->isSuccessful()) {
            throw new \Exception(sprintf("Url %s is not a valid URL or website is down.", $url));
        }
    
        return $response->getEffectiveUrl();
    }
    
    0 讨论(0)
提交回复
热议问题