Junk Spark output file on S3 with dollar signs

前端未结

关注

 2  1708

I have a simple spark job that reads a file from s3, takes five and writes back in s3. What I see is that there is always additional file in s3, next to my output \"directory\",

相关标签:

2条回答

隐瞒了意图╮

2021-01-25 23:57
Ok, it seems I found out what it is. It is some kind of marker file, probably used for determining if the S3 directory object exists or not. How I reached this conclusion? First, I found this link that shows the source of
```
org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
```
method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html

Then I googled other source repositories to see if I am going to find different version of the method. I didn't.

At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.

My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.

All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.
0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2021-01-26 00:02

Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.

0 讨论(0)
发布评论:

提交评论
- 加载中...