Loading multiple files from Google Cloud Storage into a single Pandas Dataframe

混江龙づ霸主 提交于 2019-12-07 12:08:26

问题


I have been trying to write a function that loads multiple files from a Google Cloud Storage bucket into a single Pandas Dataframe, however I cannot seem to make it work.

import pandas as pd
from google.datalab import storage
from io import BytesIO


def gcs_loader(bucket_name, prefix):
  bucket = storage.Bucket(bucket_name)
  df = pd.DataFrame()
  for shard in bucket.objects(prefix=prefix):
    fp = shard.uri
    %gcs read -o $fp -v tmp
    df.append(read_csv(BytesIO(tmp))
  return df

When I try to run it says:

undefined variable referenced in command line: $fp


回答1:


Sure, here's an example: https://colab.research.google.com/notebook#fileId=0B7I8C_4vGdF6Ynl1X25iTHE4MGc

This notebook shows the following:

  1. Create two random CSVs
  2. Upload both CSV files to a GCS bucket
  3. Uses the GCS Python API to iterate over files in the bucket. And,
  4. Merge each file into a single Pandas DataFrame.


来源:https://stackoverflow.com/questions/46885631/loading-multiple-files-from-google-cloud-storage-into-a-single-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!