发表新帖

发表新帖

How do Scrapy rules work with crawl spider

前端未结

关注

 3  1543

一个人的身影 2021-01-31 19:38

I have hard time to understand scrapy crawl spider rules. I have example that doesn\'t work as I would like it did, so it can be two things:

I don\'t understand ho

3条回答

生来不讨喜 (楼主)

2021-01-31 20:04
You are right, according to the source code before returning each response to the callback function, the crawler loops over the Rules, starting, from the first. You should have it in mind, when you write the rules. For example the following rules:
```
rules(
        Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True),
        Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
     )
```
The second rule will never be applied since all the links will be extracted by the first rule with parse_item callback. The matches for the second rule will be filtered out as duplicates by the scrapy.dupefilter.RFPDupeFilter. You should use deny for correct matching of links:
```
rules(
        Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True),
        Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
     )
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题