I need to write a web crawler for specific user agent

前端 未结 2 1339
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-20 02:04

I need to write a web crawler, and want to be able to crawl using a known user agent. For example, I want my crawler to act as an iphone to crawl the mobile site of a websi

相关标签:
2条回答
  • 2021-01-20 02:25

    Please refer to RFC1945 for how a User Agent should be formed:

    10.15 User-Agent

    The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. Although it is not required, user agents should include this field with requests. The field can contain multiple product tokens (Section 3.7) and comments identifying the agent and any subproducts which form a significant part of the user agent. By convention, the product tokens are listed in order of their significance for identifying the application.

     User-Agent     = "User-Agent" ":" 1*( product | comment )
    

    Example:

      User-Agent: CERN-LineMode/2.15 libwww/2.17b3
    

    So what you put there is more or less up to you. You could pose to be a GoogleBot-Mobile:

    • https://www.google.com/support/webmasters/bin/answer.py?answer=1061943

    or pose as an iPhone and add your own stuff

    Mozilla/5.0 (iPhone; U; CPU iPhone OS) (compatible; MyBot/1.0; +http://about.my/bot")
    
    0 讨论(0)
  • 2021-01-20 02:27
        function crawl($url){
    
            $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"; // <-- this is user agent
            $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
            $headers[]  = "Accept-Language:en-us,en;q=0.5";
            $headers[]  = "Accept-Encoding:gzip,deflate";
            $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
            $headers[]  = "Keep-Alive:115";
            $headers[]  = "Connection:keep-alive";
            $headers[]  = "Cache-Control:max-age=0";
    
            $curl = curl_init();
            curl_setopt($curl, CURLOPT_URL, $url);
            curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
            curl_setopt($curl, CURLOPT_ENCODING, "gzip");
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
            $data = curl_exec($curl);
            curl_close($curl);
            return $data;
    
        }
    
    echo crawl("http://www.google.com"); // revenge
    

    You can always use eg http://m.facebook.com/ w/o user agent, although most of websites redirect user to proper content by reading user agent.

    0 讨论(0)
提交回复
热议问题