Verifying Googlebot in .htaccess file

混江龙づ霸主 提交于 2019-12-02 00:58:21

You can use a condition with %{HTTP_USER_AGENT} variable:

RewriteEngine on

RewriteCond %{HTTP_USER_AGENT} ^googlebot
RewriteRule ^(.*)$ /do-something [L,R]

Though keep in mind that %{HTTP_USER_AGENT} can be spoofed.

In .htaccess:

Order Allow, Deny

Allow from googlebot.com
Allow from search.msn.com
Allow from crawl.yahoo.net
Allow from baidu.com
Allow from yandex.ru
Allow from yandex.net
Allow from yandex.com

Maybe some other search engines would also be a good idea?

From Apace docs: http://httpd.apache.org/docs/2.2/mod/mod_authz_host.html#allow

...It will do a reverse DNS lookup on the IP address to find the associated hostname, and then do a forward lookup on the hostname to assure that it matches the original IP address. Only if the forward and reverse DNS are consistent and the hostname matches will access be allowed.

# Validate Googlebots
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$
RewriteCond %{HTTP:Accept} ^\*/\*$
RewriteCond %{HTTP:Accept-Encoding} ="gzip,deflate"
RewriteCond %{HTTP:Accept-Language} =""
RewriteCond %{HTTP:Accept-Charset} =""
RewriteCond %{HTTP:From} ="googlebot(at)googlebot.com"
RewriteCond %{REMOTE_ADDR} ^66\.249\.(6[4-9]|7[0-9]|8[0-46-9]|9[0-5])\. [OR]
RewriteCond %{REMOTE_ADDR} ^216\.239\.(3[2-9]|[45][0-9]|6[0-3])\.0
# Optional reverse-DNS-lookup replacement for IP-address check lines above
# RewriteCond %{REMOTE_HOST} ^crawl(-([1-9][0-9]?|1[0-9]{2}|2[0-4][0-9]|25[0-5])){4}\.googlebot\.com$
RewriteRule ^ - [S=1]
# Block invalid Googlebots
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteRule ^ - [F]

Note that the optional reverse-DNS line will only work on servers which allow the use of reverse-DNS lookups.

Further, once this rDNS lookup is triggered, the format of your access log file will change; It will no longer show IP addresses as the first entry on each line, but will instead show remote hostnames. This can greatly affect your server administration process, and may cause some 'stats' programs to stop correctly reporting server access summaries. Once your server gets into this mode, it will remain that way until it is re-started.

If you have server configuration privileges, you can easily change your log file format so that it displays Remote_Addr instead of Remote_Host as the first entry on each line, regardless of whether rDNS is enabled by changing the first token in the logging format from %h to %a. See Apache mod_log_config

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!