How to specify AWS Access Key ID and Secret Access Key as part of a amazon s3n URL

后端 未结 5 1124
半阙折子戏
半阙折子戏 2020-12-30 22:43

I am passing input and output folders as parameters to mapreduce word count program from webpage.

Getting below error:

HTTP Status 500 - Requ

相关标签:
5条回答
  • 2020-12-30 23:05

    For pyspark beginner:

    Prepare

    Download jar from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
    , put this to spark jars folder

    Then you can

    1. Hadoop config file

    core-site.xml

    export AWS_ACCESS_KEY_ID=<access-key>
    export AWS_SECRET_ACCESS_KEY=<secret-key>
    
    <configuration>
      <property>
        <name>fs.s3n.impl</name>
        <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
      </property>
    
      <property>
        <name>fs.s3a.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
      </property>
    
      <property>
        <name>fs.s3.impl</name>
        <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
      </property>
    </configuration>
    

    2. pyspark config

    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")
    

    Example

    import sys
    from random import random
    from operator import add
    
    from pyspark.sql import SparkSession
    from pyspark.conf import SparkConf
    
    
    if __name__ == "__main__":
        """
            Usage: S3 sample
        """
        access_key = '<access-key>'
        secret_key = '<secret-key>'
    
        spark = SparkSession\
            .builder\
            .appName("Demo")\
            .getOrCreate()
    
        sc = spark.sparkContext
    
        # remove this block if use core-site.xml and env variable
        sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", access_key)
        sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", access_key)
        sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
        sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", secret_key)
        sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secret_key)
        sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
        sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
        sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
        sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")
    
        # fetch from s3, returns RDD
        csv_rdd = spark.sparkContext.textFile("s3n://<bucket-name>/path/to/file.csv")
        c = csv_rdd.count()
        print("~~~~~~~~~~~~~~~~~~~~~count~~~~~~~~~~~~~~~~~~~~~")
        print(c)
    
        spark.stop()
    
    0 讨论(0)
  • 2020-12-30 23:09

    Passing in the AWS Credentials as part of the Amazon s3n url is not normally recommended, security wise. Especially if that code is pushed to a repository holding service (like github). Ideally set your credentials in the conf/core-site.xml as:

    <configuration>
      <property>
        <name>fs.s3n.awsAccessKeyId</name>
        <value>XXXXXX</value>
      </property>
    
      <property>
        <name>fs.s3n.awsSecretAccessKey</name>
        <value>XXXXXX</value>
      </property>
    </configuration>
    

    or reinstall awscli on your machine.

    pip install awscli
    
    0 讨论(0)
  • 2020-12-30 23:11

    create file core-site.xml and put it in class path. In the file specify

    <configuration>
        <property>
            <name>fs.s3.awsAccessKeyId</name>
            <value>your aws access key id</value>
            <description>
                aws s3 key id
            </description>
        </property>
    
        <property>
            <name>fs.s3.awsSecretAccessKey</name>
            <value>your aws access key</value>
            <description>
                aws s3 key
            </description>
        </property>
    </configuration>
    

    Hadoop by default specifies two resources, loaded in-order from the classpath:

    • core-default.xml: Read-only defaults for hadoop
    • core-site.xml: Site-specific configuration for a given hadoop installation
    0 讨论(0)
  • 2020-12-30 23:16

    I suggest you use this:

    hadoop distcp \
    -Dfs.s3n.awsAccessKeyId=<your_access_id> \ 
    -Dfs.s3n.awsSecretAccessKey=<your_access_key> \
    s3n://origin hdfs://destinations
    

    It also works as a workaround for the occurrence of slashes in the key. The parameters with the id and access key must be supplied exactly in this order: after disctcp and before origin

    0 讨论(0)
  • 2020-12-30 23:18

    The documentation has the format: http://wiki.apache.org/hadoop/AmazonS3

     s3n://ID:SECRET@BUCKET/Path
    
    0 讨论(0)
提交回复
热议问题