PySpark using IAM roles to access S3

前端 未结 5 1486
野趣味
野趣味 2021-02-08 23:48

I\'m wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. Th

5条回答
  •  误落风尘
    2021-02-09 00:02

    IAM Role-based access to files in S3 is supported by Spark, you just need to be careful with your config. Specifically, you need:

    • Compatible versions of aws-java-sdk and hadoop-aws. This is quite brittle so only specific combinations work.
    • You must use the S3AFileSystem, not NativeS3FileSystem. The former permits role based access, whereas the later only allows user credentials.

    This is what worked for me:

    import os
    import pyspark
    from pyspark import SparkContext
    from pyspark.sql import SparkSession
    
    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 pyspark-shell'
    
    sc = SparkContext.getOrCreate()
    
    hadoopConf = sc._jsc.hadoopConfiguration()
    hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    
    spark = SparkSession(sc)
    
    df = spark.read.csv("s3a://mybucket/spark/iris/",header=True)
    df.show()
    

    It's the specific combination of aws-java-sdk:1.7.4 and hadoop-aws:2.7.1 that magically made it work. There is good guidance on troubleshooting s3a access here

    Specially note that

    Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.

    Tip: you can use mvnrepository to determine the dependency version requirements of a specific hadoop-aws JAR published by the ASF.

    Here is a useful post containing further information.

    Here's some more useful information about compatibility between the java libraries

    I was trying to get this to work in the jupyter pyspark notebook. Note that the aws-hadoop version had to match the hadoop install in the Dockerfile i.e. here.

提交回复
热议问题