问题
I've been using the metis package to run Athena queries via R. While this is great for small queries, there still does not seem to be a viable solution for queries with very large return datasets (10's of thousands of rows, for example). However, when running these same queries in the AWS console, it is fast/straightforward to use the download link to obtain the CSV file of the query result.
This got me thinking: is there a mechanism for sending the query via R but returning/obtaining the S3://
bucket location where the query results live instead of the normal results object?
回答1:
As mentioned in my comment above you could investigate the RAthena and noctua packages.
These packages connect to AWS Athena using AWS SDK's as their drivers. What this means is that they will also download the data from S3 in as similar method that is mentioned by @Zerodf. They both use data.table to load the data into R so they are pretty quick. Also you can get to Query Execution ID if required for some reason.
Here is an example of how to use the packages:
RAthena
Create a connection to AWS Athena, for more information around how to connect please look at: dbConnect
library(DBI)
con <- dbConnect(RAthena::athena())
Example in how to query Athena:
dbGetQuery(con, "select * from sampledb.elb_logs")
How to access the Query ID:
res <- dbSendQuery(con, "select * from sampledb.elb_logs")
sprintf("%s%s.csv",res@connection@info$s3_staging, res@info$QueryExecutionId)
noctua
Create a connection to AWS Athena, for more information around how to connect please look at: dbConnect
library(DBI)
con <- dbConnect(noctua::athena())
Example in how to query Athena:
dbGetQuery(con, "select * from sampledb.elb_logs")
How to access the Query ID:
res <- dbSendQuery(con, "select * from sampledb.elb_logs")
sprintf("%s%s.csv",res@connection@info$s3_staging, res@info$QueryExecutionId)
Sum up
These packages should do what you are looking for however as they download the data from the query output in s3 I don't believe you will need to go to the query execution ID to do the same process.
回答2:
You could look at the Cloudyr Project. They have a package that handles creating the signature requests for the AWS API. Then you can fire off a query, poll AWS until the query finishes (using the QueryExecutionID), and use aws.s3 to download the result set.
You can also use system() to use AWS CLI commands to execute a query, wait for the results, and download the results.
For example: You could run the following commands on the command line to get the results of a query.
$ aws athena start-query-execution --query-string "select count(*) from test_null_unquoted" --execution-context Database=stackoverflow --result-configuration OutputLocation=s3://SOMEBUCKET/ --output text
XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
Once you get the query-execution-id
, then you can check on the results.
$ aws athena get-query-execution --query-execution-id=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX --output text
QUERYEXECUTION select count(*) from test_null_unquoted XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
QUERYEXECUTIONCONTEXT stackoverflow
RESULTCONFIGURATION s3://SOMEBUCKET/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX.csv
STATISTICS 104 1403
STATUS 1528809056.658 SUCCEEDED 1528809054.945
Once the query succeeds, you can download the data.
$ aws s3 cp s3://stack-exchange/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX.csv
Edit: You can even turn those commands into a one liner (Bash example here), but I'm sure you could do the same thing in powershell.
$ eid=`aws athena start-query-execution --query-string "select count(*) from test_null_unquoted" --query-execution-context Database=SOMEDATABASE--result-configuration OutputLocation=s3://SOMEBUCKET/ --output text --output text` && until aws athena get-query-execution --query-execution-id=$eid --output text | grep "SUCCEEDE
D"; do sleep 10 | echo "waiting..."; done && aws s3 cp s3://SOMEBUCKET/$eid.csv . && unset eid
来源:https://stackoverflow.com/questions/50806014/can-you-use-athena-odbc-jdbc-to-return-the-s3-location-of-results