问题
I want to access s3 from spark, I don't want to configure any secret and access keys, I want to access with configuring the IAM role, so I followed the steps given in s3-spark
But still it is not working from my EC2 instance (which is running standalone spark)
it works when I tested
[ec2-user@ip-172-31-17-146 bin]$ aws s3 ls s3://testmys3/
2019-01-16 17:32:38 130 e.json
but it did not work when I tried like below
scala> val df = spark.read.json("s3a://testmys3/*")
I am getting the below error
19/01/16 18:23:06 WARN FileStreamSink: Error while looking for metadata directory.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: E295957C21AFAC37, AWS Error Code: null, AWS Error Message: Bad Request
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)
回答1:
"400 Bad Request" is fairly unhelpful, and not only does S3 not provide much, the S3A connector doesn't date print much related to auth either. There's a big section on troubleshooting the error
The fact it got as far as making a request means that it has some credentials, only the far end doesn't like them
Possibilities
- your IAM role doesn't have the permissions for s3:ListBucket. See IAM role permissions for working with s3a
- your bucket name is wrong
- There's some settings in fs.s3a or the AWS_ env vars which get priority over the IAM role, and they are wrong.
You should automatically have IAM auth as an authentication mechanism with the S3A connector; its the one which is checked last after: config & env vars.
- Have a look at what is set in
fs.s3a.aws.credentials.provider
-it must be unset or contain the optioncom.amazonaws.auth.InstanceProfileCredentialsProvider
- assuming you also have
hadoop
on the command line, grab storediag
hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag s3a://testmys3/
it should dump what it is up to regarding authentication.
Update
As the original poster has commented, it was due to v4 authentication being required on the specific S3 endpoint. This can be enabled on the 2.7.x version of the s3a client, but only via Java system properties. For 2.8+ there are some fs.s3a. options you can set it instead
回答2:
this config worked
./spark-shell \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
--conf spark.hadoop.fs.s3a.endpoint=s3.us-east-2.amazonaws.com \
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \ --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true \ --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
来源:https://stackoverflow.com/questions/54223242/aws-access-s3-from-spark-using-iam-role