fs.s3 configuration with two s3 account with EMR

若如初见. 提交于 2020-01-25 10:10:23

问题


I have pipeline using lambda and EMR, where I read csv from one s3 account A and write parquet to another s3 in account B. I created EMR in account B and has access to s3 in account B. I cannot add account A s3 bucket access in EMR_EC2_DefaultRole(as this account is enterprise wide data storage), so i use accessKey, secret key to access account A s3 bucket.This is done through congnito token.

METHOD1

I am using fs.s3 protocol to read csv from s3 from account A and writing to s3 on account B. I have pyspark code which reads from s3 (A) and write to parquet s3 (B) I submit job 100 of jobs at time.This pyspark code runs in EMR.

Reading using following setting

hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.set("fs.s3.awsAccessKeyId", dl_access_key)
hadoop_config.set("fs.s3.awsSecretAccessKey", dl_secret_key)
hadoop_config.set("fs.s3.awsSessionToken", dl_session_key)

spark_df_csv = spark_session.read.option("Header", "True").csv("s3://somepath")

Writing:

I am using s3a protocol s3a://some_bucket/

It works but sometimes i see

  1. _temporary folder present in s3 bucket and not all csv converted to parquet
  2. When i enable EMR concurrency to 256 (EMR-5.28) and submit 100 jobs it this i get _temporary rename error.

Issues:

  1. This method creates temporary folder and sometimes it doesn't deletes it.I can see _temporary folder in s3 bucket.
  2. when i enable EMR concurrency (EMR latest versin5.28) it allows to run steps in parallel, i get rename _temporary error for some of the files.

METHOD2:

I feel s3a is not good for parallel job. So i want to read and write using fs.s3 as it has better file commiters.

So i did this initially i set hadoop configuration as above to account A and then unset the configuration, so that it can access default account B eventually s3 bucket. In this way

hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.unset("fs.s3.awsAccessKeyId")
hadoop_config.unset("fs.s3.awsSecretAccessKey")
hadoop_config.unset("fs.s3.awsSessionToken")


spark_df_csv.repartition(1).write.partitionBy(['org_id', 'institution_id']). \
    mode('append').parquet(write_path)

Issues:

This works but the issue is let say if i trigger lambda which in turn submit job for 100 files (in loop) some 10 odd files result in access denied while writing file to s3 bucket.

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n ... 1 more\nCaused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service:

This could be because of either this unset is not working sometimes or because of parallel run Spark context/session set unset happening in paralleling? I mean spark context for one job is unsettling the hadoop configuration and other is setting it, which may cause this issue, though not sure how spark context works in parallel.

Isn't each job has separate Spark context and session. Please suggest alternatives for my situation.

来源:https://stackoverflow.com/questions/59670352/fs-s3-configuration-with-two-s3-account-with-emr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!