Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?
Why?
Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary.
But — if you insist on doing it anyway — that's what the User-Agent:
line in robots.txt is for.
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /
With lines for all the other search engines you'd like traffic from, of course. Robotstxt.org has a partial list.
As everyone know, the robots.txt is a standard to be obeyed by the crawler and hence only well-behaved agents do so. So, putting it or not doesn't matter.
If you have some data, that you do not show on the site as well, you can just change the permission and improve the security.
User-agent: * Disallow: / User-agent: Googlebot Allow: / User-agent: Slurp Allow: / User-Agent: msnbot Disallow:
Slurp is Yahoo's robot
There are more than 3 major search engines depending on which country you are talking. Facebook seem to be doing a good job listing only legitimate ones: https://facebook.com/robots.txt
So your robots.txt can be something like:
User-agent: Applebot
Allow: /
User-agent: baiduspider
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Facebot
Allow: /
User-agent: Googlebot
Allow: /
User-agent: msnbot
Allow: /
User-agent: Naverbot
Allow: /
User-agent: seznambot
Allow: /
User-agent: Slurp
Allow: /
User-agent: teoma
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: Yandex
Allow: /
User-agent: Yeti
Allow: /
User-agent: *
Disallow: /