robots.txt to disallow all pages except one? Do they override and cascade?

后端未结

关注

 4  1270

I want one page of my site to be crawled and no others.

Also, if it\'s any different than the answer above, I would also like to know the syntax for disallowing ever

相关标签:

4条回答

余生分开走

2021-02-05 00:59

The easiest way to allow access to just one page would be:

User-agent: * Allow: /under-construction Disallow: /

The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the Disallow first, then many bots will see it as saying they can't crawl anything. By putting the Allow first, those that apply the rules from top to bottom will see that they can access that page.

The expression rules are simple: the expression Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.

Your Disallow: /* means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the /* and think that you meant a literal *. So they could assume that it was okay to crawl /*foo/bar.html.

If you just want to crawl http://example.com, but nothing else, you might try:

Allow: /$ Disallow: /

The $ means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.

0 讨论(0)

发布评论:

提交评论

加载中...

甜味超标

2021-02-05 01:02

http://en.wikipedia.org/wiki/Robots.txt#Allow_directive

The order is only important to robots that follow the standard; in the case of the Google or Bing bots, the order is not important.

0 讨论(0)

发布评论:

提交评论

加载中...

醉酒成梦

2021-02-05 01:10

If you log into Google Webmaster Tools, from the left panel go to crawling, then go to Fetch as Google. Here you can test how Google will crawl each page.

In the case of blocking everything but the homepage:

User-agent: * Allow: /$ Disallow: /

will work.

0 讨论(0)

发布评论:

提交评论

加载中...

粉色の甜心

2021-02-05 01:10

you can use this below both will work

User-agent: * Allow: /$ Disallow: /

or

User-agent: * Allow: /index.php Disallow: /

the Allow must be before the Disallow because the file is read from top to bottom

Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.

The $ means "end of string," like in regular expressions. so the result of Allow : /$ is your homepage /index

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复