Identifying a Search Engine Crawler

和自甴很熟 提交于 2020-01-16 04:48:07

问题


I am working on a website which loads its data via AJAX. I also want that the whole website can be crawled by search engines like google and yahoo. I want to make 2 versions of the site... [1] When a user comes the hyperlinks should work just like GMAIL (#'ed hyperlinks) [2] When a crawler comes the hyperlinks should work normally (AJAX mode off)

How can i identify a Crawler??


回答1:


You should not present a different form of your website to your users and a crawler. If Google discovers you doing that, they may reduce your search ranking because of it. Also, if you have a version that's only for a crawler, it may break without you noticing, thus giving search engines bad data.

What I'd recommend is building a version of your site that doesn't require AJAX, and having prominent links on each page to the non-AJAX version. This will also help users who may not like the AJAX version, or who have browser which aren't capable of handling it properly.




回答2:


Crawlers can usually be identified with the User-Agent HTTP Header. Look at this page for a list of user agents for crawlers specifically. Some examples are:

Google:

  • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Googlebot/2.1 (+http://www.googlebot.com/bot.html)
  • Googlebot/2.1 (+http://www.google.com/bot.html)

Also, here are some examples for getting the user agent string in various languages:

PHP:
$_SERVER['HTTP_USER_AGENT']

Python Django:
request.META["HTTP_USER_AGENT"]

Ruby On Rails:
request.env["HTTP_USER_AGENT"]

...



回答3:


The http headers of the crawler should contain a User-Agent field. You can check this field on your server.

Here is a list of TONS of User-Agents. Some examples:

Google robot 66.249.64.XXX ->
Googlebot/2.1 ( http://www.googlebot.com/bot.html)       

Harvest-NG web crawler used by search.yahoo.com 
Harvest-NG/1.0.2     



回答4:


This approach just makes life difficult for you. It requires you to maintain two completely separate versions of the site and try to guess what version to serve to any given user. Search engines are not the only user agents that don't have JavaScript available and enabled.

Follow the principles of unobtrusive JavaScript and build on things that work. This avoids the need to determine which version to give to a user since the JS can gracefully fail while leaving a working HTML version.



来源:https://stackoverflow.com/questions/3728467/identifying-a-search-engine-crawler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!