问题
I'm having troubles reading csv files stored on my bucket on AWS S3 from EMR.
I have read quite a few posts about it and have done the following to make it works :
- Add an IAM policy allowing read & write access to s3
- Tried to pass the uris in the Argument section of the spark-submit request
I thought querying S3 from EMR on a common account was straight forward (because it works locally after defining a fileSystem and providing aws credentials), but when I run :
df = spark.read.option("delimiter", ",").csv("s3://{0}/{1}/*.csv".format(bucket_name, power_prod_key), header = True)
Nothing happens, there isn't any exception, the cluster keeps running but nothing would be executed after this line (I also have tried to specify a file instead of "*.csv" but it does the same).
I created the cluster using the aws console but here is the exported cli :
aws emr create-cluster
--applications Name=Hadoop Name=Hive Name=Pig Name=Hue Name=Spark
--ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-3482b47e","EmrManagedSlaveSecurityGroup":"sg-05c284d83c1307807","EmrManagedMasterSecurityGroup":"sg-01cd4e90f09dff3ad"}'
--release-label emr-5.21.0
--log-uri 's3n://aws-logs-597071303168-us-east-1/elasticmapreduce/'
--steps '[{"Args":["spark-submit","--deploy-mode","cluster","--py-files","s3://powercaster-bct/code/func.zip","s3://powercaster-bct/code/PowerProdPrediction.py","s3://powercaster-bct/power-production/*.csv","s3://powercaster-bct/results/rnd-frst-predictions.csv","s3://powercaster-bct/results/rnd-frst-target.csv"],"Type":"CUSTOM_JAR","ActionOnFailure":"TERMINATE_CLUSTER","Jar":"command-runner.jar","Properties":"","Name":"Spark application"}]'
--instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"m4.large","Name":"Master - 1"}]'
--configurations '[{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}]}]'
--auto-terminate
--auto-scaling-role EMR_AutoScaling_DefaultRole
--ebs-root-volume-size 10
--service-role EMR_DefaultRole
--enable-debugging
--name 'My cluster'
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION
--region us-east-1
Should I provide some specific hadoop configuration to define a fileSystem or give my credentials somehow ?
Any idea why I can't link S3 to EMR ?
来源:https://stackoverflow.com/questions/55239088/reading-from-s3-in-emr