How to open and process CSV file stored in Google Cloud Storage using Python

后端 未结 2 1312
谎友^
谎友^ 2021-01-14 06:16

I am using the Google Cloud Storage Client Library.

I am trying to open and process a CSV file (that was already uploaded to a bucket) using code like:



        
相关标签:
2条回答
  • 2021-01-14 06:54

    Try this:

    from StringIO import StringIO
    filename = '/<my_bucket/data.csv'
    with gcs.open(filename, 'r') as gcs_file:
        csv_reader = csv.reader(StringIO(gcs_file.read()), delimiter=',',
                                quotechar='"')
    

    This isn't ideal though. I've filed a feature request to have GCS files support iterating.

    0 讨论(0)
  • 2021-01-14 06:56

    I think it's better you have your own wrapper/iterator designed for csv.reader. If gcs_file was to support Iterator protocol, it is not clear what next() should return to always accommodate its consumer.

    According to csv reader doc, it

    Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable. If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

    It expects a chunk of raw bytes from the underlying file, not necessarily a line. You can have a wrapper like this (not tested):

    class CsvIterator(object)
      def __init__(self, gcs_file, chunk_size):
         self.gcs_file = gcs_file
         self.chunk_size = chunk_size
      def __iter__(self):
         return self
      def next(self):
         result = self.gcs_file.read(size=self.chunk_size)
         if not result:
            raise StopIteration()
         return result
    

    The key is to read a chunk at a time so that when you have a large file, you don't blow up memory or experience timeout from urlfetch.

    Or even simpler. To use iter built in:

    csv.reader(iter(gcs_file.readline, ''))
    
    0 讨论(0)
提交回复
热议问题