Dynamic rules based on start_urls for Scrapy CrawlSpider?

前端未结

关注

 2  1939

I\'m writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain differen

相关标签:

2条回答

不知归路

2021-01-07 17:52

Iterate over all website links in start_urls and populate allowed_domains and deny_domains arrays. And then define Rules.

start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]

allow_domains = []
deny_domains = []

for link in start_urls

    # strip http and www
    domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
    domain = domain[:-1] if domain[-1] == '/' else domain

    allow_domains.extend([domain])
    deny_domains.extend([domain])


rules = (
    Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
    Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
)

0 讨论(0)

执念已碎

2021-01-07 18:02

I've found a very similar question and used the second option presented in the accepted answer to develop a workaround for this problem, since it's not supported out-of-the-box in scrapy.

I've created a function that gets a url as an input and creates rules for it:

def rules_for_url(self, url):

    domain = Tools.get_domain(url)

    rules = (
        Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
        Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
    )

    return rules

I then override some of CrawlSpider's functions.

I changed _rules into a dictionary where the keys are the different website domains and the values are the rules for that domain (using rules_for_url). The population of _rules is done in _compile_rules
I then make the appropriate changes in _requests_to_follow and _response_downloaded to support the new way of using _rules.

_rules = {}

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()

    domain = Tools.get_domain(response.url)
    for n, rule in enumerate(self._rules[domain]):
        links = [lnk for lnk in rule.link_extractor.extract_links(response) 
                 if lnk not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        for link in links:
            seen.add(link)
            r = self._build_request(domain + ';' + str(n), link)
            yield rule.process_request(r)

def _response_downloaded(self, response):

    meta_rule = response.meta['rule'].split(';')
    domain = meta_rule[0]
    rule_n = int(meta_rule[1])

    rule = self._rules[domain][rule_n]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

def _compile_rules(self):
    def get_method(method):
        if callable(method):
            return method
        elif isinstance(method, six.string_types):
            return getattr(self, method, None)

    for url in self.start_urls:
        url_rules = self.rules_for_url(url)
        domain = Tools.get_domain(url)
        self._rules[domain] = [copy.copy(r) for r in url_rules]
        for rule in self._rules[domain]:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

See the original functions here.

Now the spider will simply go over each url in start_urls and create a set of rules specific for that url. Then use the appropriate rules for each website being crawled.

Hope this helps anyone who stumbles upon this problem in the future.

Simon.

0 讨论(0)