how to load data to jupyter notebook VM from google cloud?

纵然是瞬间 提交于 2020-08-08 18:46:06

问题


I am trying to load a bunch of csv files stored on my google cloud into my jupyter notebook. I use python 3 and gsutil does not work.

Lets's assume I have 6 .csv files in '\bucket1\1'. does anybody know what I should do?


回答1:


You are running a Jupyter Notebook on a Google Cloud VM instance. And you want to load 6 .csv files (that you currently have on your Cloud Storage) into it.

Install the dependencies:

pip install google-cloud-storage
pip install pandas

Run the following script on your Notebook:

from google.cloud import storage
import pandas as pd

bucket_name = "my-bucket-name"

storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)

# When you have your files in a subfolder of the bucket.
my_prefix = "csv/" # the name of the subfolder
blobs = bucket.list_blobs(prefix = my_prefix, delimiter = '/')

for blob in blobs:
    if(blob.name != my_prefix): # ignoring the subfolder itself 
        file_name = blob.name.replace(my_prefix, "")
        blob.download_to_filename(file_name) # download the file to the machine
        df = pd.read_csv(file_name) # load the data
        print(df)

# When you have your files on the first level of your bucket

blobs = bucket.list_blobs()

for blob in blobs:
    file_name = blob.name
    blob.download_to_filename(file_name) # download the file to the machine
    df = pd.read_csv(file_name) # load the data
    print(df)

Notes:

  • Pandas is a good dependency used when dealing with data analysis in python, so it will make it easier for you to work with the csv files.

  • The code contains 2 alternatives: one if you have the objects inside a subfolder and other one if you have the objects on the first level, use the one that applies to your case.

  • The code cycles through all the objects, so you might get errors if you have some other kind of objects in there.

  • In case you already have the files on the machine where you are running the Notebook, then you can ignore the Google Cloud Storage lines and just specify the root/relative path of each file on the "read_csv" method.

  • For more information about listing Cloud Storage objects go here and for downloading Cloud Storage objects go here.



来源:https://stackoverflow.com/questions/56721927/how-to-load-data-to-jupyter-notebook-vm-from-google-cloud

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!