Order a json by field using scrapy

后端 未结 2 1723
后悔当初
后悔当初 2021-01-26 02:22

I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with

I launch this with the comma

相关标签:
2条回答
  • 2021-01-26 02:52

    By now I've found a working solution using pipeline:

    import json
    
    class JsonWriterPipeline(object):
    
        def open_spider(self, spider):
            self.list_items = []
            self.file = open('euler.json', 'w')
    
        def close_spider(self, spider):
            ordered_list = [None for i in range(len(self.list_items))]
    
            self.file.write("[\n")
    
            for i in self.list_items:
                ordered_list[int(i['id']-1)] = json.dumps(dict(i))
    
            for i in ordered_list:
                self.file.write(str(i)+",\n")
    
            self.file.write("]\n")
            self.file.close()
    
        def process_item(self, item, spider):
            self.list_items.append(item)
            return item
    

    Though it may be non optimal, because the guide suggests in another example:

    The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

    0 讨论(0)
  • 2021-01-26 03:00

    If I needed my output file to be sorted (I will assume you have a valid reason to want this), I'd probably write a custom exporter.

    This is how Scrapy's built-in JsonItemExporter is implemented.
    With a few simple changes, you can modify it to add the items to a list in export_item(), and then sort the items and write out the file in finish_exporting().

    Since you're only scraping a few hundred items, the downsides of storing a list of them and not writing to a file until the crawl is done shouldn't be a problem to you.

    0 讨论(0)
提交回复
热议问题