How best to convert from azure blob csv format to pandas dataframe while running notebook in azure ml

前端 未结 4 1962
闹比i
闹比i 2020-12-10 14:56

I have a number of large csv (tab delimited) data stored as azure blobs, and I want to create a pandas dataframe from these. I can do this locally as follows:



        
相关标签:
4条回答
  • 2020-12-10 15:18

    I think you want to use get_blob_to_bytes, or get_blob_to_text; these should output a string which you can use to create a dataframe as

    from io import StringIO
    blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
    df = pd.read_csv(StringIO(blobstring))
    
    0 讨论(0)
  • 2020-12-10 15:19

    Simple Answer:

    Working as on 20th Sep 2020

    Below are the steps to read a CSV file from Azure Blob into a Jupyter notebook dataframe (python).

    STEP 1: First generate a SAS token & URL for the target CSV(blob) file on Azure-storage by right-clicking the blob/storage CSV file(blob file).

    STEP 2: Copy the Blob SAS URL that appears below the button used for generating SAS token and URL.

    STEP 3: Use the below line of code in your Jupyter notbook to import the desired CSV. Replace url value with your Blob SAS URL copied in the above step.

    import pandas as pd 
    url ='Your Blob SAS URL'
    df = pd.read_csv(url)
    df.head()
    
    0 讨论(0)
  • 2020-12-10 15:23

    Thanks for the answer, I think some correction is needed. You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.

    from io import StringIO
    blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
    df = pd.read_csv(StringIO(blobstring))
    
    0 讨论(0)
  • 2020-12-10 15:35

    The accepted answer will not work in the latest Azure Storage SDK. MS has rewritten the SDK completely. It's kind of annoying if you are using the old version and update it. The below code should work in the new version.

    from azure.storage.blob import ContainerClient
    from io import StringIO
    import pandas as pd
    
    conn_str = ""
    container = ""
    blob_name = ""
    
    container_client = ContainerClient.from_connection_string(
        conn_str=conn_str, 
        container_name=container
        )
    # Download blob as StorageStreamDownloader object (stored in memory)
    downloaded_blob = container_client.download_blob(blob_name)
    
    df = pd.read_csv(StringIO(downloaded_blob.content_as_text()))

    0 讨论(0)
提交回复
热议问题