Scrapy crawl all sitemap links

前端未结

关注

 2  770

I want to crawl all he links present in the sitemap.xml of a fixed site. I\'ve came across Scrapy\'s SitemapSpider. So far i\'ve extracted all the urls in t

相关标签:

2条回答

[愿得一人]

2021-01-07 10:24

Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url
        return Request(response.url, callback=self.parse_sitemap_url)

    def parse_sitemap_url(self, response):
        # do stuff with your sitemap links

0 讨论(0)

栀梦

2021-01-07 10:33

You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want. For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:

class MySpider(SitemapSpider):
    name = 'xyz'
    sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
    # list with tuples - this example contains one page 
    sitemap_rules = [('/x/', parse_x)]

    def parse_x(self, response):
        sel = Selector(response)
        paragraph = sel.xpath('//p').extract()

        return paragraph

0 讨论(0)