Reading CSV files from Google Cloud Storage using pandas

一个人想着一个人 提交于 2020-05-23 11:40:08

问题


I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe

storage_client = storage.Client()

bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)

list_temp_raw = []
for file in blobs:
    filename = file.name
    temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)

df = pd.concat(list_temp_raw)

It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.

File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line 
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'

回答1:


It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):

google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1

Also, the filename already contains the .csv extension. So change the 9th line to this:

temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')

With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:




回答2:


This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents

In [1]: import dask.bytes.core

In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}

In [3]: import gcsfs

In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
 'gcs': gcsfs.dask_link.DaskGCSFileSystem,
 'gs': gcsfs.dask_link.DaskGCSFileSystem}

As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.



来源:https://stackoverflow.com/questions/54988092/reading-csv-files-from-google-cloud-storage-using-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!