I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls
only a certain depth via e.g. DEPTH_LIMIT = 2
).
I have now achieved it without rules. I attached a meta
attribute to every start_url
and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.
Therefore, override start_requests
:
def start_requests(self):
return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]
In subsequent parsing methods we grab the meta
attribute domain = response.request.meta['domain']
, compare the domain with the extracted links and sent out new requests ourselves.