Dynamic rules based on start_urls for Scrapy CrawlSpider?

前端未结

关注

 2  1941

深忆病人 2021-01-07 17:17

I\'m writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain differen

2条回答

不知归路 (楼主)

2021-01-07 17:52

Iterate over all website links in start_urls and populate allowed_domains and deny_domains arrays. And then define Rules.

start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]

allow_domains = []
deny_domains = []

for link in start_urls

    # strip http and www
    domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
    domain = domain[:-1] if domain[-1] == '/' else domain

    allow_domains.extend([domain])
    deny_domains.extend([domain])


rules = (
    Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
    Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
)

0 讨论(0)

查看其它2个回答