问题
I have pipeline using lambda and EMR, where I read csv from one s3 account A and write parquet to another s3 in account B. I created EMR in account B and has access to s3 in account B. I cannot add account A s3 bucket access in EMR_EC2_DefaultRole(as this account is enterprise wide data storage), so i use accessKey, secret key to access account A s3 bucket.This is done through congnito token.
METHOD1
I am using fs.s3 protocol to read csv from s3 from account A and writing to s3 on account B. I have pyspark code which reads from s3 (A) and write to parquet s3 (B) I submit job 100 of jobs at time.This pyspark code runs in EMR.
Reading using following setting
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.set("fs.s3.awsAccessKeyId", dl_access_key)
hadoop_config.set("fs.s3.awsSecretAccessKey", dl_secret_key)
hadoop_config.set("fs.s3.awsSessionToken", dl_session_key)
spark_df_csv = spark_session.read.option("Header", "True").csv("s3://somepath")
Writing:
I am using s3a protocol s3a://some_bucket/
It works but sometimes i see
- _temporary folder present in s3 bucket and not all csv converted to parquet
- When i enable EMR concurrency to 256 (EMR-5.28) and submit 100 jobs it this i get _temporary rename error.
Issues:
- This method creates temporary folder and sometimes it doesn't deletes it.I can see _temporary folder in s3 bucket.
- when i enable EMR concurrency (EMR latest versin5.28) it allows to run steps in parallel, i get rename _temporary error for some of the files.
METHOD2:
I feel s3a is not good for parallel job. So i want to read and write using fs.s3 as it has better file commiters.
So i did this initially i set hadoop configuration as above to account A and then unset the configuration, so that it can access default account B eventually s3 bucket. In this way
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.unset("fs.s3.awsAccessKeyId")
hadoop_config.unset("fs.s3.awsSecretAccessKey")
hadoop_config.unset("fs.s3.awsSessionToken")
spark_df_csv.repartition(1).write.partitionBy(['org_id', 'institution_id']). \
mode('append').parquet(write_path)
Issues:
This works but the issue is let say if i trigger lambda which in turn submit job for 100 files (in loop) some 10 odd files result in access denied while writing file to s3 bucket.
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n ... 1 more\nCaused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service:
This could be because of either this unset is not working sometimes or because of parallel run Spark context/session set unset happening in paralleling? I mean spark context for one job is unsettling the hadoop configuration and other is setting it, which may cause this issue, though not sure how spark context works in parallel.
Isn't each job has separate Spark context and session. Please suggest alternatives for my situation.
来源:https://stackoverflow.com/questions/59670352/fs-s3-configuration-with-two-s3-account-with-emr