Efficiently write a Pandas dataframe to Google BigQuery

后端 未结 1 1002
无人及你
无人及你 2020-12-13 08:01

I\'m trying to upload a pandas.DataFrame to google big query using the pandas.DataFrame.to_gbq() function documented here. The problem is that

相关标签:
1条回答
  • 2020-12-13 08:40

    I did the comparison for alternative 1 and 3 in Datalab using the following code:

    from datalab.context import Context
    import datalab.storage as storage
    import datalab.bigquery as bq
    import pandas as pd
    from pandas import DataFrame
    import time
    
    # Dataframe to write
    my_data = [{1,2,3}]
    for i in range(0,100000):
        my_data.append({1,2,3})
    not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])
    
    #Alternative 1
    start = time.time()
    not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                     Context.default().project_id,
                     chunksize=10000, 
                     if_exists='append',
                     verbose=False
                     )
    end = time.time()
    print("time alternative 1 " + str(end - start))
    
    #Alternative 3
    start = time.time()
    sample_bucket_name = Context.default().project_id + '-datalab-example'
    sample_bucket_path = 'gs://' + sample_bucket_name
    sample_bucket_object = sample_bucket_path + '/Hello.txt'
    bigquery_dataset_name = 'TestDataSet'
    bigquery_table_name = 'TestTable'
    
    # Define storage bucket
    sample_bucket = storage.Bucket(sample_bucket_name)
    
    # Create or overwrite the existing table if it exists
    table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)
    
    # Write the DataFrame to GCS (Google Cloud Storage)
    %storage write --variable not_so_simple_dataframe --object $sample_bucket_object
    
    # Write the DataFrame to a BigQuery table
    table.insert_data(not_so_simple_dataframe)
    end = time.time()
    print("time alternative 3 " + str(end - start))
    

    and here are the results for n = {10000,100000,1000000}:

    n       alternative_1  alternative_3
    10000   30.72s         8.14s
    100000  162.43s        70.64s
    1000000 1473.57s       688.59s
    

    Judging from the results, alternative 3 is faster than alternative 1.

    0 讨论(0)
提交回复
热议问题