I have been trying to write a function that loads multiple files from a Google Cloud Storage bucket into a single Pandas Dataframe, however I cannot seem to make it work.
import pandas as pd
from google.datalab import storage
from io import BytesIO
def gcs_loader(bucket_name, prefix):
bucket = storage.Bucket(bucket_name)
df = pd.DataFrame()
for shard in bucket.objects(prefix=prefix):
fp = shard.uri
%gcs read -o $fp -v tmp
df.append(read_csv(BytesIO(tmp))
return df
When I try to run it says:
undefined variable referenced in command line: $fp
Sure, here's an example: https://colab.research.google.com/notebook#fileId=0B7I8C_4vGdF6Ynl1X25iTHE4MGc
This notebook shows the following:
- Create two random CSVs
- Upload both CSV files to a GCS bucket
- Uses the GCS Python API to iterate over files in the bucket. And,
- Merge each file into a single Pandas DataFrame.
来源:https://stackoverflow.com/questions/46885631/loading-multiple-files-from-google-cloud-storage-into-a-single-pandas-dataframe