PySpark using IAM roles to access S3

前端未结

关注

 5  1499

野趣味 2021-02-08 23:48

I\'m wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. Th

5条回答

误落风尘 (楼主)

2021-02-09 00:02
IAM Role-based access to files in S3 is supported by Spark, you just need to be careful with your config. Specifically, you need:
- Compatible versions of aws-java-sdk and hadoop-aws. This is quite brittle so only specific combinations work.
- You must use the S3AFileSystem, not NativeS3FileSystem. The former permits role based access, whereas the later only allows user credentials.
This is what worked for me:
```
import os
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 pyspark-shell'

sc = SparkContext.getOrCreate()

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

spark = SparkSession(sc)

df = spark.read.csv("s3a://mybucket/spark/iris/",header=True)
df.show()
```
It's the specific combination of aws-java-sdk:1.7.4 and hadoop-aws:2.7.1 that magically made it work. There is good guidance on troubleshooting s3a access here

Specially note that

Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.

Tip: you can use mvnrepository to determine the dependency version requirements of a specific hadoop-aws JAR published by the ASF.

Here is a useful post containing further information.

Here's some more useful information about compatibility between the java libraries

I was trying to get this to work in the jupyter pyspark notebook. Note that the aws-hadoop version had to match the hadoop install in the Dockerfile i.e. here.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...