Scrapy spider not terminating with use of CloseSpider extension

问题

I have set up a Scrapy spider that parses an xml feed, processing some 20,000 records.

For the purposes of development, I'd like to limit the number of items processed. From reading the Scrapy docs I identified I need to use the CloseSpider extension.

I have followed the guide on how to enable this - in my spider config I have the following:

CLOSESPIDER_ITEMCOUNT = 1
EXTENSIONS = {
    'scrapy.extensions.closespider.CloseSpider': 500,
}

However, my spider never terminates - I'm aware that the CONCURRENT_REQUESTS setting affects when the spider actually terminates (as it will carry on processing each concurrent request), but this is only set to the default of 16, and yet my spider will continue to process all the items.

I've tried using the CLOSESPIDER_TIMEOUT setting instead, but similarly this has no effect.

Here is some debug info, from when I run the spider:

2017-06-15 12:14:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myscraper)
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myscraper', 'CLOSESPIDER_ITEMCOUNT': 1, 'FEED_URI': 'file:///tmp/myscraper/export.jsonl', 'NEWSPIDER_MODULE': 'myscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myscraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.closespider.CloseSpider']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled item pipelines:
['myscraper.pipelines.MyScraperPipeline']
2017-06-15 12:14:11 [scrapy.core.engine] INFO: Spider opened

As can be seen, the CloseSpider extension and the CLOSESPIDER_ITEMCOUNT settings are being applied.

Any ideas why this is not working?

回答1:

I came up with a solution helped by parik's answer, along with my own research. It does have some unexplained behaviour though, which I will cover (comments appreciated).

In my spider's myspider_spider.py file, I have (edited for brevity):

import scrapy
from scrapy.spiders import XMLFeedSpider
from scrapy.exceptions import CloseSpider
from myspiders.items import MySpiderItem

class MySpiderSpider(XMLFeedSpider):
    name = "myspiders"
    allowed_domains = {"www.mysource.com"}
    start_urls = [
        "https://www.mysource.com/source.xml"
        ]
    iterator = 'iternodes'
    itertag = 'item'
    item_count = 0

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings)

    def __init__(self, settings):
        self.settings = settings

    def parse_node(self, response, node):
        if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
            raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
        else:
            self.item_count += 1
        id = node.xpath('id/text()').extract()
        title = node.xpath('title/text()').extract()
        item = MySpiderItem()
        item['id'] = id
        item['title'] = title

        return item

This works - if I set CLOSESPIDER_ITEMCOUNT to 10, it terminates after 10 items are processed (so, in that respect it seems to ignore CONCURRENT_REQUESTS - which was unexpected).

I commented out this in my settings.py:

#EXTENSIONS = {
#   'scrapy.extensions.closespider.CloseSpider': 500,
#}

So, it's simply using the CloseSpider exception. However, the log displays the following:

2017-06-16 10:04:15 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2017-06-16 10:04:15 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (10 items) in: file:///tmp/myspiders/export.jsonl
2017-06-16 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 600,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 8599860,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2017, 6, 16, 9, 4, 15, 615501),
 'item_scraped_count': 10,
 'log_count/DEBUG': 8,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 6, 16, 9, 3, 47, 966791)}
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)

The key thing to highlight being the first line INFO and the finish_reason - the message displayed under INFO is not the one I'm setting when raising the CloseSpider exception. It implies it's the CloseSpider extension that's stopping the spider, but I know it isn't? Very confusing.