Scrapy doesn't seem to be doing DFO

后端未结

关注

 3  1186

I have a website for which my crawler needs to follow a sequence. So for example, it needs to go a1, b1, c1 before it starts going a2 etc. each of a, b and c are handled by diff

相关标签:

3条回答

广开言路

2021-02-20 03:42
Depth first searching is exactly what you are describing:
```
search as deep into a's as possible before moving to b's
```
To change Scrapy to do breadth-first searching (a1, b1, c1, a2, etc...), change these settings:
```
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
```
*Found in the doc.scrapy.org FAQ
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2021-02-20 03:45

Scrapy use DFO by default. The reason of the sequence of crawls is that scrapy crawls pages asynchronously. Even though it use DFO, the sequence seems in unreasonable order because of network delay or something else.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-02-20 03:57

I believe that you are noticing the difference between depth-first and breadth-first searching algorithms (see Wikipedia for info on both.)

Scrapy has the ability to change which algorithm is used:

"By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:"

See http://doc.scrapy.org/en/0.14/faq.html for more information.

0 讨论(0)
发布评论:

提交评论
- 加载中...