google dataflow read from spanner

后端 未结 2 980
灰色年华
灰色年华 2020-12-22 08:01

I am trying to read a table from a Google spanner database, and write it to a text file to do a backup, using google dataflow with the python sdk. I have written the follow

相关标签:
2条回答
  • 2020-12-22 08:23

    I have reworked my code following the suggestion to simply use a ParDo, instead of using the BoundedSource class. As a reference, here is my solution; I am sure there are many ways to improve on it, and I would be happy to to hear opinions. In particular I am surprised that I have to a create a dummy PColl when starting the pipeline (if I don't, I get an error

    AttributeError: 'PBegin' object has no attribute 'windowing'

    that I could not work around. The dummy PColl feels a bit like a hack.

    from __future__ import absolute_import
    
    import datetime as dt
    import logging
    
    import apache_beam as beam
    from apache_beam.io import WriteToText
    from apache_beam.options.pipeline_options import PipelineOptions
    from apache_beam.options.pipeline_options import StandardOptions, SetupOptions
    from apache_beam.options.pipeline_options import GoogleCloudOptions
    from google.cloud.spanner.client import Client
    from google.cloud.spanner.keyset import KeySet
    
    BUCKET_URL = 'gs://my_bucket'
    OUTPUT = '%s/some_folder/' % BUCKET_URL
    PROJECT_ID = 'my_project'
    INSTANCE_ID = 'my_instance'
    DATABASE_ID = 'my_database'
    JOB_NAME = 'my_jobname'
    
    class ReadTables(beam.DoFn):
        def __init__(self, project, instance, database):
            super(ReadTables, self).__init__()
            self._project = project
            self._instance = instance
            self._database = database
    
        def process(self, element):
            # get list of tables in the database
            table_names_row = Client(self._project).instance(self._instance).database(self._database).execute_sql('SELECT t.table_name FROM information_schema.tables AS t')
            for row in table_names_row:
                if row[0] in [u'COLUMNS', u'INDEXES', u'INDEX_COLUMNS', u'SCHEMATA', u'TABLES']:    # skip these
                    continue
                yield row[0]
    
    class ReadSpannerTable(beam.DoFn):
        def __init__(self, project, instance, database):
            super(ReadSpannerTable, self).__init__()
            self._project = project
            self._instance = instance
            self._database = database
    
        def process(self, element):
            # first read the columns present in the table
            table_fields = Client(self._project).instance(self._instance).database(self._database).execute_sql("SELECT t.column_name FROM information_schema.columns AS t WHERE t.table_name = '%s'" % element)
            columns = [x[0] for x in table_fields]
    
            # next, read the actual data in the table
            keyset = KeySet(all_=True)
            results_streamed_set = Client(self._project).instance(self._instance).database(self._database).read(table=element, columns=columns, keyset=keyset)
    
            for row in results_streamed_set:
                JSON_row = { columns[i]: row[i] for i in xrange(len(columns)) }
                yield (element, JSON_row)            # output pairs of (table_name, data)
    
    def run(argv=None):
      """Main entry point"""
      pipeline_options = PipelineOptions()
      pipeline_options.view_as(SetupOptions).save_main_session = True
      pipeline_options.view_as(SetupOptions).requirements_file = "requirements.txt"
      google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
      google_cloud_options.project = PROJECT
      google_cloud_options.job_name = JOB_NAME
      google_cloud_options.staging_location = '%s/staging' % BUCKET_URL
      google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL
    
      pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'
      p = beam.Pipeline(options=pipeline_options)
    
      init   = p       | 'Begin pipeline'              >> beam.Create(["test"])                                                 # have to create a dummy transform to initialize the pipeline, surely there is a better way ?
      tables = init    | 'Get tables from Spanner'     >> beam.ParDo(ReadTables(PROJECT, INSTANCE_ID, DATABASE_ID))          # read the tables in the db
      rows = (tables   | 'Get rows from Spanner table' >> beam.ParDo(ReadSpannerTable(PROJECT, INSTANCE_ID, DATABASE_ID))    # for each table, read the entries
                       | 'Group by table'              >> beam.GroupByKey()
                       | 'Formatting'                  >> beam.Map(lambda (table_name, rows): (table_name, list(rows))))        # have to force to list here (dataflowRunner produces _Unwindowedvalues)
    
      iso_datetime = dt.datetime.now().replace(microsecond=0).isoformat()
      rows             | 'Store in GCS'                >> WriteToText(file_path_prefix=OUTPUT + iso_datetime, file_name_suffix='')
    
      result = p.run()
      result.wait_until_finish()
    
    if __name__ == '__main__':
      logging.getLogger().setLevel(logging.INFO)
      run()
    
    0 讨论(0)
  • 2020-12-22 08:42

    Google currently added support of Backup Spanner with Dataflow, you can choose related template when creating DataFlow job.

    For more: https://cloud.google.com/blog/products/gcp/cloud-spanner-adds-import-export-functionality-to-ease-data-movement

    0 讨论(0)
提交回复
热议问题