Why don't my Scrapy CrawlSpider rules work?

前端 未结 1 2025
轮回少年
轮回少年 2021-01-30 15:20

I\'ve managed to code a very simple crawler with Scrapy, with these given constraints:

  • Store all link info (e.g.: anchor text, page title), hence the 2 callbacks
1条回答
  •  情歌与酒
    2021-01-30 16:07

    Here's a scraper that works perfectly:

    from scrapy.contrib.spiders import CrawlSpider,Rule
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    from scrapySpider.items import SPage
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    
    class TestSpider4(CrawlSpider):
        name = "spiderSO"
        allowed_domains = ["cumulodata.com"]
        start_urls = ["http://www.cumulodata.com/"]
    
        extractor = SgmlLinkExtractor()
    
        rules = (
            Rule(extractor,callback='parse_links',follow=True),
            )
    
        def parse_start_url(self, response):
            list(self.parse_links(response))
    
        def parse_links(self, response):
            hxs = HtmlXPathSelector(response)
            links = hxs.select('//a')
            for link in links:
                title = ''.join(link.select('./@title').extract())
                url = ''.join(link.select('./@href').extract())
                meta={'title':title,}
                cleaned_url = "%s/?1" % url if not '/' in url.partition('//')[2] else "%s?1" % url
                yield Request(cleaned_url, callback = self.parse_page, meta=meta,)
    
        def parse_page(self, response):
            hxs = HtmlXPathSelector(response)
            item=SPage()
            item['url'] = response.url
            item['title']=response.meta['title']
            item['h1']=hxs.select('//h1/text()').extract()
            return item
    

    Changes:

    1. Implemented parse_start_url - Unfortunately, when you specify a callback for the first request, rules are not executed. This is inbuilt into Scrapy, and we can only manage this with a workaround. So we do a list(self.parse_links(response)) inside this function. Why the list()? Because parse_links is a generator, and generators are lazy. So we need to explicitly call it fully.

    2. cleaned_url = "%s/?1" % url if not '/' in url.partition('//')[2] else "%s?1" % url - There are a couple of things going on here:

      a. We're adding '/?1' to the end of the URL - Since parse_links returns duplicate URLs, Scrapy filters them out. An easier way to avoid that is to pass dont_filter=True to Request(). However, all your pages are interlinked (back to index from pageAA, etc.) and a dont_filter here results in too many duplicate requests & items.

      b. if not '/' in url.partition('//')[2] - Again, this is because of the linking in your website. One of the internal links is to 'www.cumulodata.com' and another to 'www.cumulodata.com/'. Since we're explicitly adding a mechanism to allow duplicates, this was resulting in one extra item. Since we needed perfect, I implemented this hack.

    3. title = ''.join(link.select('./@title').extract()) - You don't want to return the node, but the data. Also: ''.join(list) is better than list[0] in case of an empty list.

    Congrats on creating a test website which posed a curious problem - Duplicates are both necessary as well as unwanted!

    0 讨论(0)
提交回复
热议问题