How to Create Dataframe from AWS Athena using Boto3 get_query_results method

前端 未结 5 2185
广开言路
广开言路 2021-02-13 05:37

I\'m using AWS Athena to query raw data from S3. Since Athena writes the query output into S3 output bucket I used to do:

df = pd.read_csv(OutputLocation)
         


        
5条回答
  •  鱼传尺愫
    2021-02-13 06:23

    A very simple solution is to use a list comprehension with the boto3 Athena paginator. The list comprehension can then be simply passed into the pd.DataFrame() to create a DataFrame as such,

    pd.DataFrame([[data.get('VarCharValue') for data in row['Data']] for row in
                  results['ResultSet']['Rows']])
    

    Boto3 Athena to Pandas DataFrame

    import pandas as pd
    import boto3
    
    result = get_query_results( . . . ) # your code here
    
    def cleanQueryResult(result) :
        '''
        This will take the dictionary of the raw Boto3 Athena results and turn it into a 
        2D array for further processing
    
        Parameters
        ----------
        result dict
            The dictionary from the boto3 Athena client function get_query_results
    
        Returns
        -------
        list(list())
            2D list which is essentially the table result. The first row is the column name.
        '''
        return [[data.get('VarCharValue') for data in row['Data']]
                for row in result['ResultSet']['Rows']]
    
    # note that row 1 is the header
    df = pd.DataFrame(cleanQueryResult(result))
    
    

    Millions of Results

    This requires a the paginator object, https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/athena.html#paginators

    As a hint, here's how you can append after each page

    df.append(pd.DataFrame(cleanQueryResult(next_page), ignore_index = True))
    

提交回复
热议问题