Dynamic rules based on start_urls for Scrapy CrawlSpider?

前端 未结 2 1940
深忆病人
深忆病人 2021-01-07 17:17

I\'m writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain differen

2条回答
  •  执念已碎
    2021-01-07 18:02

    I've found a very similar question and used the second option presented in the accepted answer to develop a workaround for this problem, since it's not supported out-of-the-box in scrapy.

    I've created a function that gets a url as an input and creates rules for it:

    def rules_for_url(self, url):
    
        domain = Tools.get_domain(url)
    
        rules = (
            Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
            Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
        )
    
        return rules
    

    I then override some of CrawlSpider's functions.

    1. I changed _rules into a dictionary where the keys are the different website domains and the values are the rules for that domain (using rules_for_url). The population of _rules is done in _compile_rules

    2. I then make the appropriate changes in _requests_to_follow and _response_downloaded to support the new way of using _rules.

    _rules = {}
    
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
    
        domain = Tools.get_domain(response.url)
        for n, rule in enumerate(self._rules[domain]):
            links = [lnk for lnk in rule.link_extractor.extract_links(response) 
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(domain + ';' + str(n), link)
                yield rule.process_request(r)
    
    def _response_downloaded(self, response):
    
        meta_rule = response.meta['rule'].split(';')
        domain = meta_rule[0]
        rule_n = int(meta_rule[1])
    
        rule = self._rules[domain][rule_n]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
    
    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)
    
        for url in self.start_urls:
            url_rules = self.rules_for_url(url)
            domain = Tools.get_domain(url)
            self._rules[domain] = [copy.copy(r) for r in url_rules]
            for rule in self._rules[domain]:
                rule.callback = get_method(rule.callback)
                rule.process_links = get_method(rule.process_links)
                rule.process_request = get_method(rule.process_request)
    

    See the original functions here.

    Now the spider will simply go over each url in start_urls and create a set of rules specific for that url. Then use the appropriate rules for each website being crawled.

    Hope this helps anyone who stumbles upon this problem in the future.

    Simon.

提交回复
热议问题