Scrapy crawl all sitemap links

前端 未结 2 768
一个人的身影
一个人的身影 2021-01-07 09:57

I want to crawl all he links present in the sitemap.xml of a fixed site. I\'ve came across Scrapy\'s SitemapSpider. So far i\'ve extracted all the urls in t

相关标签:
2条回答
  • 2021-01-07 10:24

    Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:

    class MySpider(SitemapSpider):
        name = "xyz"
        allowed_domains = ["xyz.nl"]
        sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 
    
        def parse(self, response):
            print response.url
            return Request(response.url, callback=self.parse_sitemap_url)
    
        def parse_sitemap_url(self, response):
            # do stuff with your sitemap links
    
    0 讨论(0)
  • 2021-01-07 10:33

    You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want. For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:

    class MySpider(SitemapSpider):
        name = 'xyz'
        sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
        # list with tuples - this example contains one page 
        sitemap_rules = [('/x/', parse_x)]
    
        def parse_x(self, response):
            sel = Selector(response)
            paragraph = sel.xpath('//p').extract()
    
            return paragraph
    
    0 讨论(0)
提交回复
热议问题