问题
I have set up a Scrapy spider that parses an xml feed, processing some 20,000 records.
For the purposes of development, I'd like to limit the number of items processed. From reading the Scrapy docs I identified I need to use the CloseSpider extension.
I have followed the guide on how to enable this - in my spider config I have the following:
CLOSESPIDER_ITEMCOUNT = 1
EXTENSIONS = {
'scrapy.extensions.closespider.CloseSpider': 500,
}
However, my spider never terminates - I'm aware that the CONCURRENT_REQUESTS
setting affects when the spider actually terminates (as it will carry on processing each concurrent request), but this is only set to the default of 16, and yet my spider will continue to process all the items.
I've tried using the CLOSESPIDER_TIMEOUT
setting instead, but similarly this has no effect.
Here is some debug info, from when I run the spider:
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myscraper)
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myscraper', 'CLOSESPIDER_ITEMCOUNT': 1, 'FEED_URI': 'file:///tmp/myscraper/export.jsonl', 'NEWSPIDER_MODULE': 'myscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myscraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.closespider.CloseSpider']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled item pipelines:
['myscraper.pipelines.MyScraperPipeline']
2017-06-15 12:14:11 [scrapy.core.engine] INFO: Spider opened
As can be seen, the CloseSpider
extension and the CLOSESPIDER_ITEMCOUNT
settings are being applied.
Any ideas why this is not working?
回答1:
I came up with a solution helped by parik's answer, along with my own research. It does have some unexplained behaviour though, which I will cover (comments appreciated).
In my spider's myspider_spider.py
file, I have (edited for brevity):
import scrapy
from scrapy.spiders import XMLFeedSpider
from scrapy.exceptions import CloseSpider
from myspiders.items import MySpiderItem
class MySpiderSpider(XMLFeedSpider):
name = "myspiders"
allowed_domains = {"www.mysource.com"}
start_urls = [
"https://www.mysource.com/source.xml"
]
iterator = 'iternodes'
itertag = 'item'
item_count = 0
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings)
def __init__(self, settings):
self.settings = settings
def parse_node(self, response, node):
if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
else:
self.item_count += 1
id = node.xpath('id/text()').extract()
title = node.xpath('title/text()').extract()
item = MySpiderItem()
item['id'] = id
item['title'] = title
return item
This works - if I set CLOSESPIDER_ITEMCOUNT
to 10, it terminates after 10 items are processed (so, in that respect it seems to ignore CONCURRENT_REQUESTS
- which was unexpected).
I commented out this in my settings.py
:
#EXTENSIONS = {
# 'scrapy.extensions.closespider.CloseSpider': 500,
#}
So, it's simply using the CloseSpider
exception. However, the log displays the following:
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2017-06-16 10:04:15 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (10 items) in: file:///tmp/myspiders/export.jsonl
2017-06-16 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 600,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 8599860,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'closespider_itemcount',
'finish_time': datetime.datetime(2017, 6, 16, 9, 4, 15, 615501),
'item_scraped_count': 10,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 6, 16, 9, 3, 47, 966791)}
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)
The key thing to highlight being the first line INFO
and the finish_reason
- the message displayed under INFO
is not the one I'm setting when raising the CloseSpider
exception. It implies it's the CloseSpider
extension that's stopping the spider, but I know it isn't? Very confusing.
回答2:
You can use also CloseSpider Exception to limit the number of items,
just pay attention, CloseSpider exception is only supported in the spider callbacks.
as you can see in documentation
This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported arguments:
some examples
来源:https://stackoverflow.com/questions/44566184/scrapy-spider-not-terminating-with-use-of-closespider-extension