How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

前端 未结 2 1680
死守一世寂寞
死守一世寂寞 2020-12-05 00:42

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fie

相关标签:
2条回答
  • 2020-12-05 01:23

    To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):

    from scrapy import signals
    from scrapy.contrib.exporter import CsvItemExporter
    
    class CSVPipeline(object):
    
      def __init__(self):
        self.files = {}
    
      @classmethod
      def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline
    
      def spider_opened(self, spider):
        file = open('%s_items.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.fields_to_export = [list with Names of fields to export - order is important]
        self.exporter.start_exporting()
    
      def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()
    
      def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
    

    Of course you need to remember to add this pipeline in your configuration file (settings.py):

    ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
    
    0 讨论(0)
  • 2020-12-05 01:23

    You can now specify settings in the spider itself. https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider

    To set the field order for exported feeds, set FEED_EXPORT_FIELDS. https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields

    The spider below dumps all links on a website (written against Scrapy 1.4.0):

    import scrapy
    from scrapy.http import HtmlResponse
    
    class DumplinksSpider(scrapy.Spider):
      name = 'dumplinks'
      allowed_domains = ['www.example.com']
      start_urls = ['http://www.example.com/']
      custom_settings = {
        # specifies exported fields and order
        'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
      }
    
      def parse(self, response):
        if not isinstance(response, HtmlResponse):
          return
    
        a_selectors = response.xpath('//a')
        for i, a_selector in enumerate(a_selectors):
          text = a_selector.xpath('normalize-space(text())').extract_first()
          url = a_selector.xpath('@href').extract_first()
          yield {
            'page_ix': i + 1,
            'page': response.url,
            'text': text,
            'url': url,
          }
          yield response.follow(url, callback=self.parse)  # see allowed_domains
    

    Run with this command:

    scrapy crawl dumplinks --loglevel=INFO -o links.csv
    

    Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.

    0 讨论(0)
提交回复
热议问题