How can i load my csv from google dataLab to a pandas data frame?

半腔热情 提交于 2019-12-17 20:35:58

问题


Here is what i tried: (ipython notebook, with python2.7)

import gcp
import gcp.storage as storage
import gcp.bigquery as bq
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

sample_bucket_name = gcp.Context.default().project_id + '-datalab'
sample_bucket_path = 'gs://' + sample_bucket_name 
sample_bucket_object = sample_bucket_path + '/myFile.csv'
sample_bucket = storage.Bucket(sample_bucket_name)
df = bq.Query(sample_bucket_object).to_dataframe()

Which fails.
would you have any leads what i am doing wrong ?


回答1:


In addition to @Flair's comments about %gcs, I got the following to work for the Python 3 kernel:

    import pandas as pd
    from io import BytesIO

    %gcs read --object "gs://[BUCKET ID]/[FILE].csv" --variable csv_as_bytes

    df = pd.read_csv(BytesIO(csv_as_bytes))
    df.head()



回答2:


Based on the datalab source code bq.Query() is primarily used to execute BigQuery SQL queries. In in terms of reading a file from Google Cloud Storage (GCS), one potential solution is to use the datalab %gcs line magic function to read the csv from GCS into a local variable. Once you have the data in a variable, you can then use the pd.read_csv() function to convert the csv formatted data into a pandas DataFrame. The following should work:

import pandas as pd
from StringIO import StringIO

# Read csv file from GCS into a variable
%gcs read --object gs://cloud-datalab-samples/cars.csv --variable cars

# Store in a pandas dataframe
df = pd.read_csv(StringIO(cars))

There is also a related stackoverflow question at the following link: Reading in a file with Google datalab




回答3:


You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.

Make sure you have Dask is installed.

conda install dask #conda

pip install dask[complete] #pip

import dask.dataframe as dd #Import

dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data

dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data

This is all you need to load the data.

You can filter and manipulate data with Pandas syntax now.

dataframe['z'] = dataframe.x + dataframe.y

dataframe_pd = dataframe.compute()



来源:https://stackoverflow.com/questions/37990467/how-can-i-load-my-csv-from-google-datalab-to-a-pandas-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!