Can I block search crawlers for every site on an Apache web server?

前端 未结 6 1126
走了就别回头了
走了就别回头了 2021-01-31 05:13

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I\'d really not like it if the staging sites get indexed. <

6条回答
  •  别那么骄傲
    2021-01-31 05:53

    You can use Apache's mod_rewrite to do it. Let's assume that your real host is www.example.com and your staging host is staging.example.com. Create a file called 'robots-staging.txt' and conditionally rewrite the request to go to that.

    This example would be suitable for protecting a single staging site, a bit of a simpler use case than what you are asking for, but this has worked reliably for me:

    
      RewriteEngine on
    
      # Dissuade web spiders from crawling the staging site
      RewriteCond %{HTTP_HOST}  ^staging\.example\.com$
      RewriteRule ^robots.txt$ robots-staging.txt [L]
    
    

    You could try to redirect the spiders to a master robots.txt on a different server, but some of the spiders may balk after they get anything other than a "200 OK" or "404 not found" return code from the HTTP request, and they may not read the redirected URL.

    Here's how you would do that:

    
      RewriteEngine on
    
      # Redirect web spiders to a robots.txt file elsewhere (possibly unreliable)
      RewriteRule ^robots.txt$ http://www.example.com/robots-staging.txt [R]
    
    

提交回复
热议问题