Can I use the “Host” directive in robots.txt?

Deadly 提交于 2019-12-22 04:37:09

问题


Searching for specific information on the robots.txt, I stumbled upon a Yandex help page on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain:

User-Agent: *
Disallow: /dir/
Host: www.myhost.com

Also, the Wikipedia article states that Google too understands the Host directive, but there wasn’t much (i.e. none) information.

At robotstxt.org, I didn’t find anything on Host (or Crawl-delay as stated on Wikipedia).

  1. Is it encouraged to use the Host directive at all?
  2. Are there any resources at Google on this robots.txt specific?
  3. How is compatibility with other crawlers?

回答1:


The original robots.txt specification says:

Unrecognised headers are ignored.

They call it "headers" but this term is not defined anywhere. But as it’s mentioned in the section about the format, and in the same paragraph as User-agent and Disallow, it seems safe to assume that "headers" means "field names".

So yes, you can use Host or any other field name.

  • Robots.txt parsers that support such fields, well, support them.
  • Robots.txt parsers that don’t support such fields must ignore them.

But keep in mind: As they are not specified by the robots.txt project, you can’t be sure that different parsers support this field in the same way. So you’d have to check every supporting parser manually.



来源:https://stackoverflow.com/questions/22011604/can-i-use-the-host-directive-in-robots-txt

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!