Junk Spark output file on S3 with dollar signs

前端 未结 2 1698
南笙
南笙 2021-01-25 23:38

I have a simple spark job that reads a file from s3, takes five and writes back in s3. What I see is that there is always additional file in s3, next to my output \"directory\",

相关标签:
2条回答
  • 2021-01-25 23:57

    Ok, it seems I found out what it is. It is some kind of marker file, probably used for determining if the S3 directory object exists or not. How I reached this conclusion? First, I found this link that shows the source of

    org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
    

    method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html

    Then I googled other source repositories to see if I am going to find different version of the method. I didn't.

    At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.

    My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.

    All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.

    0 讨论(0)
  • 2021-01-26 00:02

    Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.

    0 讨论(0)
提交回复
热议问题