Scrapy doesn't seem to be doing DFO

后端 未结 3 1186
臣服心动
臣服心动 2021-02-20 03:21

I have a website for which my crawler needs to follow a sequence. So for example, it needs to go a1, b1, c1 before it starts going a2 etc. each of a, b and c are handled by diff

相关标签:
3条回答
  • 2021-02-20 03:42

    Depth first searching is exactly what you are describing:

    search as deep into a's as possible before moving to b's
    

    To change Scrapy to do breadth-first searching (a1, b1, c1, a2, etc...), change these settings:

    DEPTH_PRIORITY = 1
    SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
    SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
    

    *Found in the doc.scrapy.org FAQ

    0 讨论(0)
  • 2021-02-20 03:45

    Scrapy use DFO by default. The reason of the sequence of crawls is that scrapy crawls pages asynchronously. Even though it use DFO, the sequence seems in unreasonable order because of network delay or something else.

    0 讨论(0)
  • 2021-02-20 03:57

    I believe that you are noticing the difference between depth-first and breadth-first searching algorithms (see Wikipedia for info on both.)

    Scrapy has the ability to change which algorithm is used:

    "By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:"

    See http://doc.scrapy.org/en/0.14/faq.html for more information.

    0 讨论(0)
提交回复
热议问题