How to Create Dataframe from AWS Athena using Boto3 get_query_results method

前端 未结 5 2175
广开言路
广开言路 2021-02-13 05:37

I\'m using AWS Athena to query raw data from S3. Since Athena writes the query output into S3 output bucket I used to do:

df = pd.read_csv(OutputLocation)
         


        
5条回答
  •  别跟我提以往
    2021-02-13 06:35

    I have a solution for my first question, using the following function

    def results_to_df(results):
    
        columns = [
            col['Label']
            for col in results['ResultSet']['ResultSetMetadata']['ColumnInfo']
        ]
    
        listed_results = []
        for res in results['ResultSet']['Rows'][1:]:
             values = []
             for field in res['Data']:
                try:
                    values.append(list(field.values())[0]) 
                except:
                    values.append(list(' '))
    
            listed_results.append(
                dict(zip(columns, values))
            )
    
        return listed_results
    

    and then:

    t = results_to_df(response)
    pd.DataFrame(t)
    

    As for my 2nd question and to the request of @EricBellet I'm also adding my approach for pagination which I find as inefficient and longer in compare to loading the results from Athena output in S3:

    def run_query(query, database, s3_output):
        ''' 
        Function for executing Athena queries and return the query ID 
        '''
        client = boto3.client('athena')
        response = client.start_query_execution(
            QueryString=query,
            QueryExecutionContext={
                'Database': database
                },
            ResultConfiguration={
                'OutputLocation': s3_output,
                }
            )
        print('Execution ID: ' + response['QueryExecutionId'])
        return response
    
    
    
    def format_result(results):
        '''
        This function format the results toward append in the needed format.
        '''
        columns = [
            col['Label']
            for col in results['ResultSet']['ResultSetMetadata']['ColumnInfo']
        ]
    
        formatted_results = []
    
        for result in results['ResultSet']['Rows'][0:]:
            values = []
            for field in result['Data']:
                try:
                    values.append(list(field.values())[0]) 
                except:
                    values.append(list(' '))
    
            formatted_results.append(
                dict(zip(columns, values))
            )
        return formatted_results
    
    
    
    res = run_query(query_2, database, s3_ouput) #query Athena
    
    
    
    import sys
    import boto3
    
    marker = None
    formatted_results = []
    query_id = res['QueryExecutionId']
    i = 0
    start_time = time.time()
    
    while True:
        paginator = client.get_paginator('get_query_results')
        response_iterator = paginator.paginate( 
            QueryExecutionId=query_id,
            PaginationConfig={
                'MaxItems': 1000,
                'PageSize': 1000,
                'StartingToken': marker})
    
        for page in response_iterator:
            i = i + 1
            format_page = format_result(page)
            if i == 1:
                formatted_results = pd.DataFrame(format_page)
            elif i > 1:
                formatted_results = formatted_results.append(pd.DataFrame(format_page))
    
        try:
            marker = page['NextToken']
        except KeyError:
            break
    
    print ("My program took", time.time() - start_time, "to run")
    

    It's not formatted so good but I think it does the job...

提交回复
热议问题