Read csv from Google Cloud storage to pandas dataframe

后端 未结 7 1058
时光说笑
时光说笑 2020-11-28 03:00

I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

import pandas as pd
import matplotlib.pyplot as plt
import         


        
相关标签:
7条回答
  • 2020-11-28 03:30

    If i understood your question correctly then maybe this link can help u get a better URL for your read_csv() function :

    https://cloud.google.com/storage/docs/access-public-data

    0 讨论(0)
  • 2020-11-28 03:31

    Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:

    from tensorflow.python.lib.io import file_io
    with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
      df = pd.read_csv(f)
    

    Using tensorflow also gives you a convenient way to handle wildcards in the filename. For example:

    Reading wildcard CSV into Pandas

    Here is code that will read all CSVs that match a specific pattern (e.g: gs://bucket/some/dir/train-*) into a Pandas dataframe:

    import tensorflow as tf
    from tensorflow.python.lib.io import file_io
    import pandas as pd
    
    def read_csv_file(filename):
      with file_io.FileIO(filename, 'r') as f:
        df = pd.read_csv(f, header=None, names=['col1', 'col2'])
        return df
    
    def read_csv_files(filename_pattern):
      filenames = tf.gfile.Glob(filename_pattern)
      dataframes = [read_csv_file(filename) for filename in filenames]
      return pd.concat(dataframes)
    

    usage

    DATADIR='gs://my-bucket/some/dir'
    traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
    evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))
    
    0 讨论(0)
  • 2020-11-28 03:36

    One will still need to use import gcsfs if loading compressed files.

    Tried pd.read_csv('gs://your-bucket/path/data.csv.gz') in pd.version=> 0.25.3 got the following error,

    /opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
        438     # See https://github.com/python/mypy/issues/1297
        439     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
    --> 440         filepath_or_buffer, encoding, compression
        441     )
        442     kwds["compression"] = compression
    
    /opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
        211 
        212     if is_gcs_url(filepath_or_buffer):
    --> 213         from pandas.io import gcs
        214 
        215         return gcs.get_filepath_or_buffer(
    
    /opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/gcs.py in <module>
          3 
          4 gcsfs = import_optional_dependency(
    ----> 5     "gcsfs", extra="The gcsfs library is required to handle GCS files"
          6 )
          7 
    
    /opt/conda/anaconda/lib/python3.6/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
         91     except ImportError:
         92         if raise_on_missing:
    ---> 93             raise ImportError(message.format(name=name, extra=extra)) from None
         94         else:
         95             return None
    
    ImportError: Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.
    
    0 讨论(0)
  • 2020-11-28 03:40

    UPDATE

    As of version 0.24 of pandas, read_csv supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:

    df = pd.read_csv('gs://bucket/your_path.csv')
    

    I leave three other options for the sake of completeness.

    • Home-made code
    • gcsfs
    • dask

    I will cover them below.

    The hard way: do-it-yourself code

    I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.

    It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.

    How to use it:

    fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
    df = pd.read_csv(fileobj)
    

    The code:

    from io import BytesIO, StringIO
    from google.cloud import storage
    from google.oauth2 import service_account
    
    def get_byte_fileobj(project: str,
                         bucket: str,
                         path: str,
                         service_account_credentials_path: str = None) -> BytesIO:
        """
        Retrieve data from a given blob on Google Storage and pass it as a file object.
        :param path: path within the bucket
        :param project: name of the project
        :param bucket_name: name of the bucket
        :param service_account_credentials_path: path to credentials.
               TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
        :return: file object (BytesIO)
        """
        blob = _get_blob(bucket, path, project, service_account_credentials_path)
        byte_stream = BytesIO()
        blob.download_to_file(byte_stream)
        byte_stream.seek(0)
        return byte_stream
    
    def get_bytestring(project: str,
                       bucket: str,
                       path: str,
                       service_account_credentials_path: str = None) -> bytes:
        """
        Retrieve data from a given blob on Google Storage and pass it as a byte-string.
        :param path: path within the bucket
        :param project: name of the project
        :param bucket_name: name of the bucket
        :param service_account_credentials_path: path to credentials.
               TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
        :return: byte-string (needs to be decoded)
        """
        blob = _get_blob(bucket, path, project, service_account_credentials_path)
        s = blob.download_as_string()
        return s
    
    
    def _get_blob(bucket_name, path, project, service_account_credentials_path):
        credentials = service_account.Credentials.from_service_account_file(
            service_account_credentials_path) if service_account_credentials_path else None
        storage_client = storage.Client(project=project, credentials=credentials)
        bucket = storage_client.get_bucket(bucket_name)
        blob = bucket.blob(path)
        return blob
    

    gcsfs

    gcsfs is a "Pythonic file-system for Google Cloud Storage".

    How to use it:

    import pandas as pd
    import gcsfs
    
    fs = gcsfs.GCSFileSystem(project='my-project')
    with fs.open('bucket/path.csv') as f:
        df = pd.read_csv(f)
    

    dask

    Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas API, making it easy to use for newcomers.

    Here is the read_csv

    How to use it:

    import dask.dataframe as dd
    
    df = dd.read_csv('gs://bucket/data.csv')
    df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!
    
    # df is now Dask dataframe, ready for distributed processing
    # If you want to have the pandas version, simply:
    df_pd = df.compute()
    
    0 讨论(0)
  • 2020-11-28 03:43

    read_csv does not support gs://

    From the documentation:

    The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv

    You can download the file or fetch it as a string in order to manipulate it.

    0 讨论(0)
  • 2020-11-28 03:48

    As of pandas==0.24.0 this is supported natively if you have gcsfs installed: https://github.com/pandas-dev/pandas/pull/22704.

    Until the official release you can try it out with pip install pandas==0.24.0rc1.

    0 讨论(0)
提交回复
热议问题