I\'m using AWS Athena to query raw data from S3. Since Athena writes the query output into S3 output bucket I used to do:
df = pd.read_csv(OutputLocation)
A very simple solution is to use a list comprehension with the boto3 Athena paginator. The list comprehension can then be simply passed into the pd.DataFrame()
to create a DataFrame as such,
pd.DataFrame([[data.get('VarCharValue') for data in row['Data']] for row in
results['ResultSet']['Rows']])
import pandas as pd
import boto3
result = get_query_results( . . . ) # your code here
def cleanQueryResult(result) :
'''
This will take the dictionary of the raw Boto3 Athena results and turn it into a
2D array for further processing
Parameters
----------
result dict
The dictionary from the boto3 Athena client function get_query_results
Returns
-------
list(list())
2D list which is essentially the table result. The first row is the column name.
'''
return [[data.get('VarCharValue') for data in row['Data']]
for row in result['ResultSet']['Rows']]
# note that row 1 is the header
df = pd.DataFrame(cleanQueryResult(result))
This requires a the paginator object, https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/athena.html#paginators
As a hint, here's how you can append after each page
df.append(pd.DataFrame(cleanQueryResult(next_page), ignore_index = True))