Crawl multiple domains with Scrapy without criss-cross

后端 未结 2 783
梦毁少年i
梦毁少年i 2021-01-07 03:28

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).



        
相关标签:
2条回答
  • 2021-01-07 04:06

    I have now achieved it without rules. I attached a meta attribute to every start_url and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.

    Therefore, override start_requests:

    def start_requests(self):
        return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]
    

    In subsequent parsing methods we grab the meta attribute domain = response.request.meta['domain'], compare the domain with the extracted links and sent out new requests ourselves.

    0 讨论(0)
  • 2021-01-07 04:23

    You would probably need to keep a data structure (ex a hashmap) of URLs that the crawler has already visited. Then it's just a matter of adding URLs to the hashmap as you visit them and not visiting URLs if they're in the hashmap already (as this means you have already visited them). There are probably more complicated ways of doing this which would give you greater performace, but these would also be harder to implement.

    0 讨论(0)
提交回复
热议问题