Optimizing Airflow Task that transfers data from BigQuery into MongoDB

问题

I need to improve the performance of an Airflow task that transfers data from BigQuery to MongoDB. The relevant task in my DAG uses a PythonOperator, and simply calls the following python function to transfer a single table/collection:

def transfer_full_table(table_name):
    start_time = time.time()

    # (1) Connect to BigQuery + Mongo DB
    bq = bigquery.Client()
    cluster = MongoClient(MONGO_URI)
    db = cluster["dbname"]
    print(f'(1) Connected to BQ + Mongo: {round(time.time() - start_time, 5)}')

    # (2)-(3) Run the BQ Queries
    full_query = f"select * from `gcpprojectid.models.{table_name}`"
    results1 = bq.query(full_query)
    print(f'(2) Queried BigQuery: {round(time.time() - start_time, 5)}')
    results = results1.to_dataframe()
    print(f'(3) Converted to Pandas DF: {round(time.time() - start_time, 5)}')

    # (4) Handle Missing DateTimes   # Can we refactor this into its own function?
    datetime_cols = [key for key in dict(results.dtypes) if is_datetime(results[key])]
    for col in datetime_cols:
        results[[col]] = results[[col]].astype(object).where(results[[col]].notnull(), None)
    print(f'(4) Resolved Datetime Issue: {round(time.time() - start_time, 5)}')

    # (5) And Insert Properly Into Mongo
    db[table_name].drop()
    db[table_name].insert_many(results.to_dict('records'))
    print(f'(5) Wrote to Mongo: {round(time.time() - start_time, 5)}')

The DAG is setup to transfer many tables from BigQuery to MongoDB (one transfer for each task), and this particular transfer_full_table function is meant to transfer an entire singular table, so it simply:

queries entire BQ table
converts to pandas, fixes type issue
drop previous MongoDB collection and reinsert

I am attempting to use this function on a table that is 60MB in size, and here is the performance of the various parts of the task:

(1) Connected to BQ + Mongo: 0.0786
(2) Queried BigQuery: 0.80595
(3) Converted to Pandas DF: 87.2797
(4) Resolved Datetime Issue: 88.33461
(5) Wrote to Mongo: 213.92398

Steps 3 and 5 are taking all of the time. The task very quickly connects to BQ and Mongo (1), and BQ can very quickly query this table at 60MB (2). However, when I convert to a pandas dataframe (3) (needed for (4) to handle a type issue I was having), this step takes ~86.5 seconds. Resolving the date-time issue is then very fast (4), however at the end, dropping the previous MongoDB collection and re-inserting the new pandas dataframe into MongoDB (5) then takes (213.9 - 88.3) = ~125 seconds.

Any tips, either on the Pandas or on the MongoDB end, as to how I can optimize for these two bottlenecks, would be greatly appreciated!

回答1:

The short answer is that asynchronous operations are muddying your profiling.

The docs on bq.query state that the resulting google.cloud.bigquery.job.QueryJob object is an asyncronous query job. This means that, after the query is submitted, the python interpreter does not block until you try to use the results of the query with one of the syncronous QueryJob methods, to_dataframe(). A significant share of the 87 seconds you're seeing is likely just spent waiting for the query to return.

You could wait for the query to be complete by calling QueryJob.done iteratively until it returns true, then call your 2nd profiling print statement.

This isn't quite an optimization of your code, but hopefully helps move in the right direction. It's possible some tuning of the pandas roundtrip could help, but I think it's likely that most of your time is being spent waiting for read/write from your databases, and that writing more efficient or a larger number of smaller queries is going to be your only option for cutting down the total time.

来源：https://stackoverflow.com/questions/62078173/optimizing-airflow-task-that-transfers-data-from-bigquery-into-mongodb

标签

python

pandas

mongodb

airflow