Dynamic rules based on start_urls for Scrapy CrawlSpider?

前端 未结 2 1941
深忆病人
深忆病人 2021-01-07 17:17

I\'m writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain differen

2条回答
  •  不知归路
    2021-01-07 17:52

    Iterate over all website links in start_urls and populate allowed_domains and deny_domains arrays. And then define Rules.

    start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]
    
    allow_domains = []
    deny_domains = []
    
    for link in start_urls
    
        # strip http and www
        domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
        domain = domain[:-1] if domain[-1] == '/' else domain
    
        allow_domains.extend([domain])
        deny_domains.extend([domain])
    
    
    rules = (
        Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
        Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
    )
    

提交回复
热议问题