问题
Here is what i tried: (ipython notebook, with python2.7)
import gcp
import gcp.storage as storage
import gcp.bigquery as bq
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
sample_bucket_name = gcp.Context.default().project_id + '-datalab'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/myFile.csv'
sample_bucket = storage.Bucket(sample_bucket_name)
df = bq.Query(sample_bucket_object).to_dataframe()
Which fails.
would you have any leads what i am doing wrong ?
回答1:
In addition to @Flair's comments about %gcs, I got the following to work for the Python 3 kernel:
import pandas as pd
from io import BytesIO
%gcs read --object "gs://[BUCKET ID]/[FILE].csv" --variable csv_as_bytes
df = pd.read_csv(BytesIO(csv_as_bytes))
df.head()
回答2:
Based on the datalab source code bq.Query()
is primarily used to execute BigQuery SQL queries. In in terms of reading a file from Google Cloud Storage (GCS), one potential solution is to use the datalab %gcs
line magic function to read the csv from GCS into a local variable. Once you have the data in a variable, you can then use the pd.read_csv()
function to convert the csv formatted data into a pandas DataFrame. The following should work:
import pandas as pd
from StringIO import StringIO
# Read csv file from GCS into a variable
%gcs read --object gs://cloud-datalab-samples/cars.csv --variable cars
# Store in a pandas dataframe
df = pd.read_csv(StringIO(cars))
There is also a related stackoverflow question at the following link: Reading in a file with Google datalab
回答3:
You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.
Make sure you have Dask is installed.
conda install dask #conda
pip install dask[complete] #pip
import dask.dataframe as dd #Import
dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data
This is all you need to load the data.
You can filter and manipulate data with Pandas syntax now.
dataframe['z'] = dataframe.x + dataframe.y
dataframe_pd = dataframe.compute()
来源:https://stackoverflow.com/questions/37990467/how-can-i-load-my-csv-from-google-datalab-to-a-pandas-data-frame