I\'m writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain differen
Iterate over all website links in start_urls
and populate allowed_domains
and deny_domains
arrays. And then define Rules.
start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]
allow_domains = []
deny_domains = []
for link in start_urls
# strip http and www
domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
domain = domain[:-1] if domain[-1] == '/' else domain
allow_domains.extend([domain])
deny_domains.extend([domain])
rules = (
Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
)