Dynamic rules based on start_urls for Scrapy CrawlSpider?

前端 未结 2 1939
深忆病人
深忆病人 2021-01-07 17:17

I\'m writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain differen

相关标签:
2条回答
  • 2021-01-07 17:52

    Iterate over all website links in start_urls and populate allowed_domains and deny_domains arrays. And then define Rules.

    start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]
    
    allow_domains = []
    deny_domains = []
    
    for link in start_urls
    
        # strip http and www
        domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
        domain = domain[:-1] if domain[-1] == '/' else domain
    
        allow_domains.extend([domain])
        deny_domains.extend([domain])
    
    
    rules = (
        Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
        Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
    )
    
    0 讨论(0)
  • 2021-01-07 18:02

    I've found a very similar question and used the second option presented in the accepted answer to develop a workaround for this problem, since it's not supported out-of-the-box in scrapy.

    I've created a function that gets a url as an input and creates rules for it:

    def rules_for_url(self, url):
    
        domain = Tools.get_domain(url)
    
        rules = (
            Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
            Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
        )
    
        return rules
    

    I then override some of CrawlSpider's functions.

    1. I changed _rules into a dictionary where the keys are the different website domains and the values are the rules for that domain (using rules_for_url). The population of _rules is done in _compile_rules

    2. I then make the appropriate changes in _requests_to_follow and _response_downloaded to support the new way of using _rules.

    _rules = {}
    
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
    
        domain = Tools.get_domain(response.url)
        for n, rule in enumerate(self._rules[domain]):
            links = [lnk for lnk in rule.link_extractor.extract_links(response) 
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(domain + ';' + str(n), link)
                yield rule.process_request(r)
    
    def _response_downloaded(self, response):
    
        meta_rule = response.meta['rule'].split(';')
        domain = meta_rule[0]
        rule_n = int(meta_rule[1])
    
        rule = self._rules[domain][rule_n]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
    
    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)
    
        for url in self.start_urls:
            url_rules = self.rules_for_url(url)
            domain = Tools.get_domain(url)
            self._rules[domain] = [copy.copy(r) for r in url_rules]
            for rule in self._rules[domain]:
                rule.callback = get_method(rule.callback)
                rule.process_links = get_method(rule.process_links)
                rule.process_request = get_method(rule.process_request)
    

    See the original functions here.

    Now the spider will simply go over each url in start_urls and create a set of rules specific for that url. Then use the appropriate rules for each website being crawled.

    Hope this helps anyone who stumbles upon this problem in the future.

    Simon.

    0 讨论(0)
提交回复
热议问题