Write BigQuery results to GCS in CSV format using Apache Beam

前端 未结 1 658
盖世英雄少女心
盖世英雄少女心 2021-01-07 04:56

I am pretty new working on Apache Beam , where in I am trying to write a pipeline to extract the data from Google BigQuery and write the data to GCS in CSV format using Pyth

相关标签:
1条回答
  • 2021-01-07 05:13

    You can do so using WriteToText to add a .csv suffix and headers. Take into account that you'll need to parse the query results to CSV format. As an example, I used the Shakespeare public dataset and the following query:

    SELECT word, word_count, corpus FROM `bigquery-public-data.samples.shakespeare` WHERE CHAR_LENGTH(word) > 3 ORDER BY word_count DESC LIMIT 10

    We now read the query results with:

    BQ_DATA = p | 'read_bq_view' >> beam.io.Read(
        beam.io.BigQuerySource(query=query, use_standard_sql=True))
    

    BQ_DATA now contains key-value pairs:

    {u'corpus': u'hamlet', u'word': u'HAMLET', u'word_count': 407}
    {u'corpus': u'kingrichardiii', u'word': u'that', u'word_count': 319}
    {u'corpus': u'othello', u'word': u'OTHELLO', u'word_count': 313}
    

    We can apply a beam.Map function to yield only values:

    BQ_VALUES = BQ_DATA | 'read values' >> beam.Map(lambda x: x.values())
    

    Excerpt of BQ_VALUES:

    [u'hamlet', u'HAMLET', 407]
    [u'kingrichardiii', u'that', 319]
    [u'othello', u'OTHELLO', 313]
    

    And finally map again to have all column values separated by commas instead of a list (take into account that you would need to escape double quotes if they can appear within a field):

    BQ_CSV = BQ_VALUES | 'CSV format' >> beam.Map(
        lambda row: ', '.join(['"'+ str(column) +'"' for column in row]))
    

    Now we write the results to GCS with the suffix and headers:

    BQ_CSV | 'Write_to_GCS' >> beam.io.WriteToText(
        'gs://{0}/results/output'.format(BUCKET), file_name_suffix='.csv', header='word, word count, corpus')
    

    Written results:

    $ gsutil cat gs://$BUCKET/results/output-00000-of-00001.csv
    word, word count, corpus
    "hamlet", "HAMLET", "407"
    "kingrichardiii", "that", "319"
    "othello", "OTHELLO", "313"
    "merrywivesofwindsor", "MISTRESS", "310"
    "othello", "IAGO", "299"
    "antonyandcleopatra", "ANTONY", "284"
    "asyoulikeit", "that", "281"
    "antonyandcleopatra", "CLEOPATRA", "274"
    "measureforemeasure", "your", "274"
    "romeoandjuliet", "that", "270"
    
    0 讨论(0)
提交回复
热议问题