How do Scrapy rules work with crawl spider

前端 未结 3 1543
一个人的身影
一个人的身影 2021-01-31 19:38

I have hard time to understand scrapy crawl spider rules. I have example that doesn\'t work as I would like it did, so it can be two things:

  1. I don\'t understand ho
3条回答
  •  生来不讨喜
    2021-01-31 20:04

    You are right, according to the source code before returning each response to the callback function, the crawler loops over the Rules, starting, from the first. You should have it in mind, when you write the rules. For example the following rules:

    rules(
            Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True),
            Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
         )
    

    The second rule will never be applied since all the links will be extracted by the first rule with parse_item callback. The matches for the second rule will be filtered out as duplicates by the scrapy.dupefilter.RFPDupeFilter. You should use deny for correct matching of links:

    rules(
            Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True),
            Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
         )
    

提交回复
热议问题