问题
I have links being indexed that shouldn't. I need to remove them from google. What should I enter to robots.txt Link example http://sitename.com/wp-content/uploads/2014/02/The-Complete-Program-2014.pdf
回答1:
With robots.txt, you can disallow crawling, not indexing.
With this robots.txt
User-agent: *
Disallow: /wp-content/uploads/2014/02/The-Complete-Program-2014.pdf
any URL whose path starts with /wp-content/uploads/2014/02/The-Complete-Program-2014.pdf
is not allowed to be crawled.
But if a bot finds this URL in some other way (e.g., linked by someone else), they might still index it (without ever crawling/visiting it). The same goes for search engines that already indexed it: they might keep it (but will no longer visit it).
To disallow indexing, you could use the HTTP header X-Robots-Tag
with the noindex
parameter. In that case, you should not block crawling of the file in robots.txt, otherwise bots would never be able to see your headers (and so they would never know that you don’t want this file to get indexed).
来源:https://stackoverflow.com/questions/32129121/disallow-pdf-files-from-indexing-robots-txt