Common rule in robots.txt

。_饼干妹妹 提交于 2019-12-11 11:02:19

问题


How can I disallow URLs like 1.html, 2.html, ..., [0-9]+.html (in terms of regexp) with robots.txt?


回答1:


The original robots.txt specification doesn't support regex/wildcards. However, you could block URLs like these:

  • example.com/1.html
  • example.com/2367123.html
  • example.com/3
  • example.com/4/foo
  • example.com/5/1
  • example.com/6/
  • example.com/7.txt
  • example.com/883
  • example.com/9to5

with:

User-agent: *
Disallow: /0
Disallow: /1
Disallow: /2
Disallow: /3
Disallow: /4
Disallow: /5
Disallow: /6
Disallow: /7
Disallow: /8
Disallow: /9

If you want to block only URLs starting with a single numeral followed by .html, just append .html, like:

User-agent: *
Disallow: /0.html
Disallow: /1.html
…

However, this wouldn't block, for example, example.com/12.html



来源:https://stackoverflow.com/questions/13863586/common-rule-in-robots-txt

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!