How to convert csv into a dictionary in apache beam dataflow

前端 未结 2 971
独厮守ぢ
独厮守ぢ 2020-12-05 21:14

I would like to read a csv file and write it to BigQuery using apache beam dataflow. In order to do this I need to present the data to BigQuery in the form of a dictionary.

相关标签:
2条回答
  • 2020-12-05 21:46

    Edit: as of version 2.12.0, Beam comes with new fileio transforms that allow you to read from CSV without having to reimplement a source. You can do this like so:

    def get_csv_reader(readable_file):
      # You can return whichever kind of reader you want here
      # a DictReader, or a normal csv.reader.
      if sys.version_info >= (3, 0):
        return csv.reader(io.TextIOWrapper(readable_file.open()))
      else:
        return csv.reader(readable_file.open())
    
    with Pipeline(...) as p:
      content_pc = (p
                    | beam.io.fileio.MatchFiles("/my/file/name")
                    | beam.io.fileio.ReadMatches()
                    | beam.Reshuffle()  # Useful if you expect many matches
                    | beam.FlatMap(get_csv_reader))
    

    I recently wrote a test for this for Apache Beam. You can take a look on the Github repository.


    The old answer relied on reimplementing a source. This is no longer the main recommended way of doing this : )

    The idea is to have a source that returns parsed CSV rows. You can do this by subclassing the FileBasedSource class to include CSV parsing. Particularly, the read_records function would look something like this:

    class MyCsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
      def read_records(self, file_name, range_tracker):
        self._file = self.open_file(file_name)
    
        reader = csv.reader(self._file)
    
        for rec in reader:
          yield rec
    
    0 讨论(0)
  • 2020-12-05 21:54

    As a supplement to Pablo's post, I'd like to share a little change I made myself to his sample. (+1 for you!)

    Changed: reader = csv.reader(self._file) to reader = csv.DictReader(self._file)

    The csv.DictReader uses the first row of the CSV file as Dict keys. The other rows are used to populate a dict per row with it's values. It'll automatically put the right values to the correct keys based on column order.

    One little detail is that every value in the Dict is stored as string. This may conflict your BigQuery schema if you use eg. INTEGER for some fields. So you need to take care of proper casting afterwards.

    0 讨论(0)
提交回复
热议问题