Crawl multiple domains with Scrapy without criss-cross

后端 未结 2 774
梦毁少年i
梦毁少年i 2021-01-07 03:28

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).



        
2条回答
  •  被撕碎了的回忆
    2021-01-07 04:06

    I have now achieved it without rules. I attached a meta attribute to every start_url and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.

    Therefore, override start_requests:

    def start_requests(self):
        return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]
    

    In subsequent parsing methods we grab the meta attribute domain = response.request.meta['domain'], compare the domain with the extracted links and sent out new requests ourselves.

提交回复
热议问题