问题
TL;DR: asyncio
vs multi-processing
vs threading
vs. some other solution
to parallelize for loop that reads files from GCS, then appends this data together into a pandas dataframe, then writes to BigQuery...
I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.
Here is a not-parallel version of the function:
import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
# Create new table
output_df = pd.DataFrame()
fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
counter = 0
for file in files:
# read files from GCS
with fs.open(file, 'r') as f:
gcs_data = json.loads(f.read())
data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
this_df = pd.DataFrame(data)
output_df = output_df.append(this_df)
# Write to BigQuery for every 5K rows of data
counter += 1
if (counter % 5000 == 0):
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
output_df = pd.DataFrame() # and reset the dataframe
# Write remaining rows to BigQuery
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
This function is straightforward:
- grab
['gcs_dir/file1.json', 'gcs_dir/file2.json', ...]
, the list of file names in GCS - loop over each file name, and:
- read the file from GCS
- converts the data into a pandas DF
- appends to a main pandas DF
- every 5K loops, write to BigQuery (since the appends get much slower as the DF gets larger)
I have to run this function on a few GCS directories each with ~500K files. Due to the bottleneck of reading/writing this many small files, this process will take ~24 hours for a single directory... It would be great if I could make this more parallel to speed things up, as it seems like a task that lends itself to parallelization.
Edit: The solutions below are helpful, but I am particularly interested in running in parallel from within the python script. Pandas is handling some data cleaning, and using bq load
will throw errors. There is asyncio and this gcloud-aio-storage that both seem potentially useful for this task, maybe as better options than threading or multiprocessing...
回答1:
Rather than add parallel processing to your python code, consider invoking your python program multiple times in parallel. This is a trick that lends itself more easily to a program that takes a list of files on the command line. So, for the sake of this post, let's consider changing one line in your program:
Your line:
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
New line:
files = sys.argv[1:] # ok, import sys, too
Now, you can invoke your program this way:
PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program
xargs
will now take the file names output by get_gcs_file_list.py
and invoke your_program
up to 100 times in parallel, fitting as many file names as it can on each line. I believe the number of file names is limited to the maximum command size allowed by the shell. If 100 processes is not enough to process all your files, xargs will invoke your_program
again (and again) until all file names it reads from stdin are processed. xargs
ensures that no more than 100 invocations of your_program
are running simultaneously. You can vary the number of processes based on the resources available to your host.
回答2:
Instead of doing this, you can directly use bq
command.
The bq command-line tool is a Python-based command-line tool for BigQuery.
When you use this command, loading takes place in google's network which is very fast than we creating a dataframe and loading to table.
bq load \
--autodetect \
--source_format=NEWLINE_DELIMITED_JSON \
mydataset.mytable \
gs://mybucket/my_json_folder/*.json
For more information - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table
来源:https://stackoverflow.com/questions/63045305/python-read-json-files-from-gcs-into-pandas-df-in-parallel