How to stop scrapy spider after certain number of requests?

后端未结

关注

 5  641

I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping

相关标签:

5条回答

长发绾君心

2021-02-04 20:44

count is local to the parse() method so it's not preserved between pages. Change all occurences of count to self.count to make it an instance variable of the class and it will persist betwen pages.

0 讨论(0)

发布评论:

提交评论

加载中...

不要未来只要你来

2021-02-04 20:47

First: Use self.count and initialize outside of parse. Then don't prevent the parsing of the items, but generating new requests. See the following code:

# -*- coding: utf-8 -*- import scrapy from scrapy import Item, Field class GagItem(Item): entry_id = Field() url = Field() votes = Field() comments = Field() title = Field() img_url = Field() class FirstSpider(scrapy.Spider): name = "first" allowed_domains = ["9gag.com"] start_urls = ('http://www.9gag.com/', ) last_gag_id = None COUNT_MAX = 30 count = 0 def parse(self, response): for article in response.xpath('//article'): gag_id = article.xpath('@data-entry-id').extract() ninegag_item = GagItem() ninegag_item['entry_id'] = gag_id[0] ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() self.last_gag_id = gag_id[0] self.count = self.count + 1 yield ninegag_item if (self.count < self.COUNT_MAX): next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id yield scrapy.Request(url=next_url, callback=self.parse)

0 讨论(0)

发布评论:

提交评论

加载中...

渐次进展

2021-02-04 20:47

One can use custom_settings with CLOSESPIDER_PAGECOUNT as shown below.

# -*- coding: utf-8 -*- import scrapy from scrapy import Item, Field class GagItem(Item): entry_id = Field() url = Field() votes = Field() comments = Field() title = Field() img_url = Field() class FirstSpider(scrapy.Spider): name = "first" allowed_domains = ["9gag.com"] start_urls = ('http://www.9gag.com/', ) last_gag_id = None COUNT_MAX = 30 custom_settings = { 'CLOSESPIDER_PAGECOUNT': COUNT_MAX } def parse(self, response): for article in response.xpath('//article'): gag_id = article.xpath('@data-entry-id').extract() ninegag_item = GagItem() ninegag_item['entry_id'] = gag_id[0] ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() self.last_gag_id = gag_id[0] yield ninegag_item next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id yield scrapy.Request(url=next_url, callback=self.parse)

0 讨论(0)

发布评论:

提交评论

加载中...

借酒劲吻你

2021-02-04 20:50

Spider arguments are passed through the crawl command using the -a option.check link

0 讨论(0)

发布评论:

提交评论

加载中...

故里飘歌

2021-02-04 21:01

There's a built-in setting CLOSESPIDER_PAGECOUNT that can be passed via command-line -s argument or changed in settings: scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

One small caveat is that if you've enabled caching, it will count cache hits as page counts as well.

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复