发表新帖

发表新帖

Crawl multiple domains with Scrapy without criss-cross

后端未结

关注

 2  783

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).

相关标签:

2条回答

被撕碎了的回忆

2021-01-07 04:06
I have now achieved it without rules. I attached a meta attribute to every start_url and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.

Therefore, override start_requests:
```
def start_requests(self):
    return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]
```
In subsequent parsing methods we grab the meta attribute domain = response.request.meta['domain'], compare the domain with the extracted links and sent out new requests ourselves.
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2021-01-07 04:23

You would probably need to keep a data structure (ex a hashmap) of URLs that the crawler has already visited. Then it's just a matter of adding URLs to the hashmap as you visit them and not visiting URLs if they're in the hashmap already (as this means you have already visited them). There are probably more complicated ways of doing this which would give you greater performace, but these would also be harder to implement.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题