Scrapy crawl all sitemap links

前端 未结 2 767
一个人的身影
一个人的身影 2021-01-07 09:57

I want to crawl all he links present in the sitemap.xml of a fixed site. I\'ve came across Scrapy\'s SitemapSpider. So far i\'ve extracted all the urls in t

2条回答
  •  栀梦
    栀梦 (楼主)
    2021-01-07 10:33

    You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want. For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:

    class MySpider(SitemapSpider):
        name = 'xyz'
        sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
        # list with tuples - this example contains one page 
        sitemap_rules = [('/x/', parse_x)]
    
        def parse_x(self, response):
            sel = Selector(response)
            paragraph = sel.xpath('//p').extract()
    
            return paragraph
    

提交回复
热议问题