Robots.txt, how to allow access only to domain root, and no deeper? [closed]

前端未结

关注

 2  1185

忘掉有多难

相关标签:

2条回答

轮回少年

2021-02-19 16:11
There's nothing that will work for all crawlers. There are two options that might be useful to you.

Robots that allow wildcards should support something like:
```
Disallow: /*/
```
The major search engine crawlers understand the wildcards, but unfortunately most of the smaller ones don't.

If you have relatively few files in the root and you don't often add new files, you could use Allow to allow access to just those files, and then use Disallow: / to restrict everything else. That is:
```
User-agent: *
Allow: /index.html
Allow: /coolstuff.jpg
Allow: /morecoolstuff.html
Disallow: /
```
The order here is important. Crawlers are supposed to take the first match. So if your first rule was Disallow: /, a properly behaving crawler wouldn't get to the following Allow lines.

If a crawler doesn't support Allow, then it's going to see the Disallow: / and not crawl anything on your site. Providing, of course, that it ignores things in robots.txt that it doesn't understand.

All the major search engine crawlers support Allow, and a lot of the smaller ones do, too. It's easy to implement.
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2021-02-19 16:13
In short no there is no way to do this nicely using the robots.txt standard. Remember the Disallow specifies a path prefix. Wildcards and allows are non-standard.

So the following approach (a kludge!) will work.
```
User-agent: *
Disallow: /a
Disallow: /b
Disallow: /c
...
Disallow: /z
Disallow: /A
Disallow: /B
Disallow: /C
...
Disallow: /Z
Disallow: /0
Disallow: /1
Disallow: /2
...
Disallow: /9
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题