I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls
only a certain depth via e.g. DEPTH_LIMIT = 2
).
I have now achieved it without rules. I attached a meta
attribute to every start_url
and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.
Therefore, override start_requests
:
def start_requests(self):
return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]
In subsequent parsing methods we grab the meta
attribute domain = response.request.meta['domain']
, compare the domain with the extracted links and sent out new requests ourselves.
You would probably need to keep a data structure (ex a hashmap) of URLs that the crawler has already visited. Then it's just a matter of adding URLs to the hashmap as you visit them and not visiting URLs if they're in the hashmap already (as this means you have already visited them). There are probably more complicated ways of doing this which would give you greater performace, but these would also be harder to implement.