how to identify web crawlers of google/yahoo/msn by PHP?

前端 未结 8 1538
清酒与你
清酒与你 2020-12-29 17:51

AFAIK,

$_SERVER[\'REMOTE_HOST\'] should end up with \"google.com\" or \"yahoo.com\".

but is it the most ensuring method?

any other way out?

相关标签:
8条回答
  • 2020-12-29 17:56
    $_SERVER['HTTP_USER_AGENT']
    
    • Google Bot = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    • MSN Bot = msnbot-products/1.0 (+http://search.msn.com/msnbot.htm)

    Check various user agent strings here: http://www.user-agents.org/

    0 讨论(0)
  • 2020-12-29 17:58

    I dont think crawlers comes from google.com and I know some other people you don't want to treat as bots that comes from there. All who search for your site.

    What you need to do is take a look at the IP of the different bots. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=80553

    0 讨论(0)
  • 2020-12-29 18:01

    You identify search engines by user agent and IP address. More info can be found in How to identify search engine spiders and webbots. It's also worth noting this list. You shouldn't treat user agents (or even remote hosts) as necessarily definitive however. User agents are really nothing more than what the other end tells you it is and it is of course free to tell you anything. It's trivial to write code to pretend to be Googlebot.

    In PHP, this means looking at $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REMOTE_HOST'].

    There are a lot of search engines but honestly it's only the big few you really care about generally speaking. Google and Yahoo together have almost all of the market. But of course it depends on what you're trying to achieve.

    Note: be very careful of treating search engines differently to normal users (like the "evil hyphen site" as Joel put it) when it comes to content. In particularly egregious cases, this could get your site removed from that search engine. Even if that doesn't happen you will probably put some users off who go to a site expecting something. If they're then presented with a "Please register to see this article" box instead, well, gratz on your high bounce rate.

    0 讨论(0)
  • 2020-12-29 18:02

    First of all, I hope you're not trying to do this in order to serve search engine bots different content than your site contains for normal users. If they discover you doing this, your site will get removed from their listings entirely. So long as you understand the risks of it, you can usually find information about what unique user-agent they will use:

    • Verifying Googlebot (use user-agent, reverse DNS if you want to be sure)
    • Yahoo's user agent will contain "Slurp"

    However, some people writing (usually poorly-behaved) web scrapers will set their User Agent strings to be the same as "legitimate" crawlers such as Google's. You can catch these by doing lookups on the bot's IP address/hostname to ensure that they actually are coming from Google/Yahoo/etc. Some more info about what to look for in hostname lookups (from this article):

    • Google crawlers will end with googlebot.com like in crawl-66-249-70-244.googlebot.com.
    • Yahoo crawlers will end with crawl.yahoo.net like in llf520064.crawl.yahoo.net.
    • Live Search crawlers will end with search.msn.com like in msnbot-65-55-104-161.search.msn.com.
    • Ask crawlers will end with ask.com like in crawler4037.ask.com.
    0 讨论(0)
  • 2020-12-29 18:11

    You are probably better off using $_SERVER['HTTP_USER_AGENT'] and look for Googlebot or Yahoo! Slurp.

    0 讨论(0)
  • 2020-12-29 18:11

    The best way to do it with well know and behaving robots, like those you mentioned, is by user agent which you can find on $_SERVER['HTTP_USER_AGENT'].

    0 讨论(0)
提交回复
热议问题