Identifying a Search Engine Crawler

问题

I am working on a website which loads its data via AJAX. I also want that the whole website can be crawled by search engines like google and yahoo. I want to make 2 versions of the site... [1] When a user comes the hyperlinks should work just like GMAIL (#'ed hyperlinks) [2] When a crawler comes the hyperlinks should work normally (AJAX mode off)

How can i identify a Crawler??

回答1:

You should not present a different form of your website to your users and a crawler. If Google discovers you doing that, they may reduce your search ranking because of it. Also, if you have a version that's only for a crawler, it may break without you noticing, thus giving search engines bad data.

What I'd recommend is building a version of your site that doesn't require AJAX, and having prominent links on each page to the non-AJAX version. This will also help users who may not like the AJAX version, or who have browser which aren't capable of handling it properly.

回答2:

Crawlers can usually be identified with the User-Agent HTTP Header. Look at this page for a list of user agents for crawlers specifically. Some examples are:

Google:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
Googlebot/2.1 (+http://www.google.com/bot.html)

Also, here are some examples for getting the user agent string in various languages:

PHP:
$_SERVER['HTTP_USER_AGENT']

Python Django:
request.META["HTTP_USER_AGENT"]

Ruby On Rails:
request.env["HTTP_USER_AGENT"]

...

回答3:

The http headers of the crawler should contain a User-Agent field. You can check this field on your server.

Here is a list of TONS of User-Agents. Some examples:

Google robot 66.249.64.XXX ->
Googlebot/2.1 ( http://www.googlebot.com/bot.html)       

Harvest-NG web crawler used by search.yahoo.com 
Harvest-NG/1.0.2

回答4:

This approach just makes life difficult for you. It requires you to maintain two completely separate versions of the site and try to guess what version to serve to any given user. Search engines are not the only user agents that don't have JavaScript available and enabled.

Follow the principles of unobtrusive JavaScript and build on things that work. This avoids the need to determine which version to give to a user since the JS can gracefully fail while leaving a working HTML version.

来源：https://stackoverflow.com/questions/3728467/identifying-a-search-engine-crawler

标签

ajax

web

web-crawler