问题
Searching for specific information on the robots.txt
, I stumbled upon a Yandex help page on this topic. It suggests that I could use the Host
directive to tell crawlers my preferred mirror domain:
User-Agent: *
Disallow: /dir/
Host: www.myhost.com
Also, the Wikipedia article states that Google too understands the Host
directive, but there wasn’t much (i.e. none) information.
At robotstxt.org, I didn’t find anything on Host
(or Crawl-delay
as stated on Wikipedia).
- Is it encouraged to use the
Host
directive at all? - Are there any resources at Google on this
robots.txt
specific? - How is compatibility with other crawlers?
回答1:
The original robots.txt specification says:
Unrecognised headers are ignored.
They call it "headers" but this term is not defined anywhere. But as it’s mentioned in the section about the format, and in the same paragraph as User-agent
and Disallow
, it seems safe to assume that "headers" means "field names".
So yes, you can use Host
or any other field name.
- Robots.txt parsers that support such fields, well, support them.
- Robots.txt parsers that don’t support such fields must ignore them.
But keep in mind: As they are not specified by the robots.txt project, you can’t be sure that different parsers support this field in the same way. So you’d have to check every supporting parser manually.
来源:https://stackoverflow.com/questions/22011604/can-i-use-the-host-directive-in-robots-txt