PySpark using IAM roles to access S3

前端 未结 5 1500
野趣味
野趣味 2021-02-08 23:48

I\'m wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. Th

相关标签:
5条回答
  • 2021-02-08 23:51

    After more research, I'm convinced this is not yet supported as evidenced here.

    Others have suggested taking a more manual approach (see this blog post) which suggests to list s3 keys using boto, then parallelize that list using Spark to read each object.

    The problem here (and I don't yet see how they themselves get around it) is that the s3 objects given back from listing within a bucket are not serializable/pickle-able (remember : it's suggested that these objects are given to the workers to read in independent processes via map or flatMap). Furthering the problem is that the boto s3 client itself isn't serializable (which is reasonable in my opinion).

    What we're left with is the only choice of recreating the assumed-role s3 client per file, which isn't optimal or feasible past a certain point.

    If anyone sees any flaws in this reasoning or an alternative solution/approach, I'd love to hear it.

    0 讨论(0)
  • 2021-02-09 00:02

    IAM Role-based access to files in S3 is supported by Spark, you just need to be careful with your config. Specifically, you need:

    • Compatible versions of aws-java-sdk and hadoop-aws. This is quite brittle so only specific combinations work.
    • You must use the S3AFileSystem, not NativeS3FileSystem. The former permits role based access, whereas the later only allows user credentials.

    This is what worked for me:

    import os
    import pyspark
    from pyspark import SparkContext
    from pyspark.sql import SparkSession
    
    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 pyspark-shell'
    
    sc = SparkContext.getOrCreate()
    
    hadoopConf = sc._jsc.hadoopConfiguration()
    hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    
    spark = SparkSession(sc)
    
    df = spark.read.csv("s3a://mybucket/spark/iris/",header=True)
    df.show()
    

    It's the specific combination of aws-java-sdk:1.7.4 and hadoop-aws:2.7.1 that magically made it work. There is good guidance on troubleshooting s3a access here

    Specially note that

    Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.

    Tip: you can use mvnrepository to determine the dependency version requirements of a specific hadoop-aws JAR published by the ASF.

    Here is a useful post containing further information.

    Here's some more useful information about compatibility between the java libraries

    I was trying to get this to work in the jupyter pyspark notebook. Note that the aws-hadoop version had to match the hadoop install in the Dockerfile i.e. here.

    0 讨论(0)
  • 2021-02-09 00:06

    Hadoop 2.8+'s s3a connector supports IAM roles via a new credential provider; It's not in the Hadoop 2.7 release.

    To use it you need to change the credential provider.

    fs.s3a.aws.credentials.provider = org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
    fs.s3a.access.key = <your access key>
    fs.s3a.secret.key = <session secret>
    fs.s3a.session.token = <session token>
    

    What is in Hadoop 2.7 (and enabled by default) is the picking up of the AWS_ environment variables.

    If you set the AWS env vars for session login on your local system and the remote ones then they should get picked up.

    I know its a pain, but as far as the Hadoop team are concerned Hadoop 2.7 shipped mid-2016 and we've done a lot since then, stuff which we aren't going to backport

    0 讨论(0)
  • 2021-02-09 00:06

    You could try the approach in Locally reading S3 files through Spark (or better: pyspark).

    However I've had better luck with setting environment variables (AWS_ACCESS_KEY_ID etc) in Bash ... pyspark will automatically pick these up for your session.

    0 讨论(0)
  • 2021-02-09 00:17

    IAM role for accessing s3 is only support by s3a, because it is using AWS SDK.

    You need to put hadoop-aws JAR and aws-java-sdk JAR (and third-party Jars in its package) into your CLASSPATH.

    hadoop-aws link.

    aws-java-sdk link.

    Then set this in core-site.xml:

    <property>
        <name>fs.s3.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>
    <property>
        <name>fs.s3a.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>
    
    0 讨论(0)
提交回复
热议问题