问题
Here is content of my robots.txt file:
User-agent: *
Disallow: /images/
Disallow: /upload/
Disallow: /admin/
As you can see, I explicitly disallowed all robots to index the folders images
, upload
and admin
. The problem is that one of my clients sent request for removing the content from the images folder because .pdf document from the images
folder appeared in the google search results. Can anyone explain me what I'm doing wrong here, and why google indexed my folders?
Thx!
回答1:
Quoting Google Webmaster Docs
If I block Google from crawling a page using a robots.txt disallow directive, will it disappear from search results?
Blocking Google from crawling a page is likely to decrease that page's ranking or cause it to drop out altogether over time. It may also reduce the amount of detail provided to users in the text below the search result. This is because without the page's content, the search engine has much less information to work with.
--
However, robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.
Set X-Robots-Tag header with noindex for all files in the folders. Set this header from your webserver config for the folders. https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?hl=de
Set header from Apache Config for pdf files:
<Files ~ "\.pdf$"> Header set X-Robots-Tag "noindex, nofollow" </Files>
Disable directory index'ing / listing of this folder.
Add a empty index.html with a "noindex" robots meta tag.
<meta name="robots" content="noindex, nofollow" /> <meta name="googlebot" content="noindex" />
Force the removal of the indexed pages by manually using webmaster tools.
Question in the comment: How to forbid all files in the folder?
// 1) Deny folder access completely
<Directory /var/www/denied_directory>
Order allow,deny
</Directory>
// 2) inside the folder, place a .htaccess, denying access to all, except to index.html
Order allow,deny
Deny from all
<FilesMatch index\.html>
Allow from all
</FilesMatch>
// 3) allow directory, but disallow specifc environment match
BrowserMatch "GoogleBot" go_away_badbot
BrowserMatch ^BadRobot/0.9 go_away_badbot
<Directory /deny_access_for_badbot>
order allow,deny
allow from all
deny from env=go_away_badbot
</Directory>
// 4) or redirect bots to main page, sending http status 301
BrowserMatch Googlebot badbot=1
RewriteEngine on
RewriteCond %{ENV:badbot} =1
RewriteRule ^/$ /main/ [R=301,L]
来源:https://stackoverflow.com/questions/25764711/google-is-ignoring-my-robots-txt