问题
I need to improve the performance of an Airflow task that transfers data from BigQuery to MongoDB. The relevant task in my DAG
uses a PythonOperator
, and simply calls the following python function to transfer a single table/collection:
def transfer_full_table(table_name):
start_time = time.time()
# (1) Connect to BigQuery + Mongo DB
bq = bigquery.Client()
cluster = MongoClient(MONGO_URI)
db = cluster["dbname"]
print(f'(1) Connected to BQ + Mongo: {round(time.time() - start_time, 5)}')
# (2)-(3) Run the BQ Queries
full_query = f"select * from `gcpprojectid.models.{table_name}`"
results1 = bq.query(full_query)
print(f'(2) Queried BigQuery: {round(time.time() - start_time, 5)}')
results = results1.to_dataframe()
print(f'(3) Converted to Pandas DF: {round(time.time() - start_time, 5)}')
# (4) Handle Missing DateTimes # Can we refactor this into its own function?
datetime_cols = [key for key in dict(results.dtypes) if is_datetime(results[key])]
for col in datetime_cols:
results[[col]] = results[[col]].astype(object).where(results[[col]].notnull(), None)
print(f'(4) Resolved Datetime Issue: {round(time.time() - start_time, 5)}')
# (5) And Insert Properly Into Mongo
db[table_name].drop()
db[table_name].insert_many(results.to_dict('records'))
print(f'(5) Wrote to Mongo: {round(time.time() - start_time, 5)}')
The DAG is setup to transfer many tables from BigQuery to MongoDB (one transfer for each task), and this particular transfer_full_table
function is meant to transfer an entire singular table, so it simply:
- queries entire BQ table
- converts to pandas, fixes type issue
- drop previous MongoDB collection and reinsert
I am attempting to use this function on a table that is 60MB in size, and here is the performance of the various parts of the task:
(1) Connected to BQ + Mongo: 0.0786
(2) Queried BigQuery: 0.80595
(3) Converted to Pandas DF: 87.2797
(4) Resolved Datetime Issue: 88.33461
(5) Wrote to Mongo: 213.92398
Steps 3 and 5 are taking all of the time. The task very quickly connects to BQ and Mongo (1), and BQ can very quickly query this table at 60MB (2). However, when I convert to a pandas dataframe (3) (needed for (4) to handle a type
issue I was having), this step takes ~86.5 seconds. Resolving the date-time issue is then very fast (4), however at the end, dropping the previous MongoDB collection and re-inserting the new pandas dataframe into MongoDB (5) then takes (213.9 - 88.3) = ~125 seconds.
Any tips, either on the Pandas or on the MongoDB end, as to how I can optimize for these two bottlenecks, would be greatly appreciated!
回答1:
The short answer is that asynchronous operations are muddying your profiling.
The docs on bq.query
state that the resulting google.cloud.bigquery.job.QueryJob object is an asyncronous query job. This means that, after the query is submitted, the python interpreter does not block until you try to use the results of the query with one of the syncronous QueryJob
methods, to_dataframe()
. A significant share of the 87 seconds you're seeing is likely just spent waiting for the query to return.
You could wait for the query to be complete by calling QueryJob.done
iteratively until it returns true, then call your 2nd profiling print statement.
This isn't quite an optimization of your code, but hopefully helps move in the right direction. It's possible some tuning of the pandas roundtrip could help, but I think it's likely that most of your time is being spent waiting for read/write from your databases, and that writing more efficient or a larger number of smaller queries is going to be your only option for cutting down the total time.
来源:https://stackoverflow.com/questions/62078173/optimizing-airflow-task-that-transfers-data-from-bigquery-into-mongodb