Reading S3 data from Google's dataproc

问题

I'm running a pyspark application through Google's dataproc on a cluster I created. In one stage, the application needs to access a directory in an Amazon S3 directory. At that stage, I get the error:

AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

I logged onto the headnode of the cluster and set the /etc/boto.cfg with my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY information, but that didn't solve the access issue.

(1) Any other suggestions for how to access AWS S3 from a dataproc cluster?

(2) Also, what is the name of the user that dataproc uses to access the cluster? If I knew that, I could set the ~/.aws directory on the cluster for that user.

Thanks.

回答1:

Since you're using the Hadoop/Spark interfaces (like sc.textFile), everything should indeed be done through the fs.s3.* or fs.s3n.* or fs.s3a.* keys rather than trying to wire through any ~/.aws or /etc/boto.cfg settings. There are a few ways you can plumb those settings through to your Dataproc cluster:

At cluster creation time:

gcloud dataproc clusters create --properties \
    core:fs.s3.awsAccessKeyId=<s3AccessKey>,core:fs.s3.awsSecretAccessKey=<s3SecretKey> \
    --num-workers ...

The core prefix here indicates you want the settings to be placed in the core-site.xml file, as explained in the Cluster Properties documentation.

Alternatively, at job-submission time, if you're using Dataproc's APIs:

gcloud dataproc jobs submit pyspark --cluster <your-cluster> \
    --properties spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey>,spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey> \
    ...

In this case, we're passing the properties through as Spark properties, and Spark provides a handy mechanism to define "hadoop" conf properties as a subset of Spark conf, simply using the spark.hadoop.* prefix. If you're submitting at the command line over SSH, this is equivalent to:

spark-submit --conf spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey> \
    --conf spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey>

Finally, if you want to set it up at cluster creation time but prefer not to have your access keys explicitly set in your Dataproc metadata, you might opt to use an initialization action instead. There's a handy tool called bdconfig that should be present on the path with which you can modify XML settings easily:

#!/bin/bash
# Create this shell script, name it something like init-aws.sh
bdconfig set_property \
    --configuration_file /etc/hadoop/conf/core-site.xml \
    --name 'fs.s3.awsAccessKeyId' \
    --value '<s3AccessKey>' \
    --clobber
bdconfig set_property \
    --configuration_file /etc/hadoop/conf/core-site.xml \
    --name 'fs.s3.awsSecretAccessKey' \
    --value '<s3SecretKey>' \
    --clobber

Upload that to a GCS bucket somewhere, and use it at cluster creation time:

gsutil cp init-aws.sh gs://<your-bucket>/init-aws.sh
gcloud dataproc clustres create --initialization-actions \
    gs://<your-bucket>/init-aws.sh

While Dataproc metadata is indeed encrypted at rest and heavily secured just like any other user data, using the init action instead helps prevent inadvertently showing your access key/secret for example to someone standing behind your screen when viewing your Dataproc cluster properties.

回答2:

You can try with setting the AWS config, while initialization of sparkContext.

conf = < your SparkConf()>
sc = SparkContext(conf=conf)
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <s3AccessKey>)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <s3SecretKey>)

来源：https://stackoverflow.com/questions/39377635/reading-s3-data-from-googles-dataproc

标签

amazon-web-services

apache-spark

amazon-s3

google-cloud-dataproc