Dataflow Python SDK Avro Source/Sync

有些话、适合烂在心里 提交于 2021-01-29 03:00:28

问题


I am looking to ingest and write Avro files in GCS with the Python SDK. Is this currently possible with Avro leveraging the Python SDK? If so how would I do this? I see TODO comments in the source regarding this so I am not too optimistic.


回答1:


You are correct: the Python SDK does not yet support this, but it will soon.




回答2:


As of version 2.6.0 of the Apache Beam/Dataflow Python SDK, it is indeed possible to read (and write to) avro files in GCS.

Even better, the Python SDK for Beam now supports fastavro reads and writes which can be upto 10x faster than regular avro IO.

Sample code:

import apache_beam as beam
from apache_beam.io import ReadFromAvro
from apache_beam.io import WriteToAvro
import avro.schema


RUNNER = 'DataflowRunner'
GCP_PROJECT_ID = 'YOUR_PROJECT_ID'
BUCKET_NAME = 'YOUR_BUCKET_HERE'
STAGING_LOCATION = 'gs://{}/staging'.format(BUCKET_NAME)
TEMP_LOCATION = 'gs://{}/temp'.format(BUCKET_NAME)
GCS_INPUT = "gs://{}/input-*.avro".format(BUCKET_NAME)
GCS_OUTPUT = "gs://{}/".format(BUCKET_NAME)
JOB_NAME = 'conversion-test'

SCHEMA_PATH="YOUR_AVRO_SCHEMA.avsc"
AVRO_SCHEMA = avro.schema.parse(open(SCHEMA_PATH).read())

OPTIONS = {
    'runner': RUNNER,
    'job_name': JOB_NAME,
    'staging_location': STAGING_LOCATION,
    'temp_location': TEMP_LOCATION,
    'project': GCP_PROJECT_ID,
    'max_num_workers': 2,
    'save_main_session': True,
}

PIPELINE = beam.Pipeline(options=beam.pipeline.PipelineOptions(flags=[], **OPTIONS))


def main():
    # note: have to force `use_fastavro` to enable `fastavro`:
    results = PIPELINE | ReadFromAvro(file_pattern=GCS_INPUT, use_fastavro=True)
    results | WriteToAvro(file_path_prefix=GCS_OUTPUT, schema=AVRO_SCHEMA, use_fastavro=True)


if __name__ == '__main__':
    import os
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'PATH_TO_YOUR_SERVICE_ACCOUNT_KEY'
    main()


来源:https://stackoverflow.com/questions/37799065/dataflow-python-sdk-avro-source-sync

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!