I have hard time to understand scrapy crawl spider rules. I have example that doesn\'t work as I would like it did, so it can be two things:
You are right, according to the source code before returning each response to the callback function, the crawler loops over the Rules, starting, from the first. You should have it in mind, when you write the rules. For example the following rules:
rules(
Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True),
Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
)
The second rule will never be applied since all the links will be extracted by the first rule with parse_item callback. The matches for the second rule will be filtered out as duplicates by the scrapy.dupefilter.RFPDupeFilter
. You should use deny for correct matching of links:
rules(
Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True),
Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
)