I\'m wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. Th
IAM Role-based access to files in S3 is supported by Spark, you just need to be careful with your config. Specifically, you need:
aws-java-sdk
and hadoop-aws
. This is quite brittle so only specific combinations work.S3AFileSystem
, not NativeS3FileSystem
. The former permits role based access, whereas the later only allows user credentials.This is what worked for me:
import os
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 pyspark-shell'
sc = SparkContext.getOrCreate()
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark = SparkSession(sc)
df = spark.read.csv("s3a://mybucket/spark/iris/",header=True)
df.show()
It's the specific combination of aws-java-sdk:1.7.4
and hadoop-aws:2.7.1
that magically made it work. There is good guidance on troubleshooting s3a access here
Specially note that
Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.
Tip: you can use mvnrepository to determine the dependency version requirements of a specific hadoop-aws JAR published by the ASF.
Here is a useful post containing further information.
Here's some more useful information about compatibility between the java libraries
I was trying to get this to work in the jupyter pyspark notebook. Note that the aws-hadoop
version had to match the hadoop install in the Dockerfile i.e. here.