I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I\'d really not like it if the staging sites get indexed. <
Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?
To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.
Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.
Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you do have different passwords for your databases, right?)
If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.
Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt
You can use Apache's mod_rewrite to do it. Let's assume that your real host is www.example.com and your staging host is staging.example.com. Create a file called 'robots-staging.txt' and conditionally rewrite the request to go to that.
This example would be suitable for protecting a single staging site, a bit of a simpler use case than what you are asking for, but this has worked reliably for me:
<IfModule mod_rewrite.c>
RewriteEngine on
# Dissuade web spiders from crawling the staging site
RewriteCond %{HTTP_HOST} ^staging\.example\.com$
RewriteRule ^robots.txt$ robots-staging.txt [L]
</IfModule>
You could try to redirect the spiders to a master robots.txt on a different server, but some of the spiders may balk after they get anything other than a "200 OK" or "404 not found" return code from the HTTP request, and they may not read the redirected URL.
Here's how you would do that:
<IfModule mod_rewrite.c>
RewriteEngine on
# Redirect web spiders to a robots.txt file elsewhere (possibly unreliable)
RewriteRule ^robots.txt$ http://www.example.com/robots-staging.txt [R]
</IfModule>
Create a robots.txt file with the following contents:
User-agent: *
Disallow: /
Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt
).
Add the following to your httpd.conf file:
# Exclude all robots
<Location "/robots.txt">
SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt
The SetHandler
directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.
That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.
(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)
Try Using Apache to stop bad robots. You can get the user agents online or just allow browsers, rather than trying to block all bots.