发表新帖

发表新帖

How rename S3 files not HDFS in spark scala

后端未结

关注

 1  1891

I have approx 1 millions text files stored in S3 . I want to rename all files based on their folders name.

How can i do that in spark-scala ?

I am looking fo

相关标签:

1条回答

太阳男子

2021-01-16 04:10
you can use the normal HDFS APIs, something like (typed in, not tested)
```
val src = new Path("s3a://bucket/data/src")
val dest = new Path("s3a://bucket/data/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = src.getFileSystem(conf)
fs.rename(src, dest)
```
The way the S3A client fakes a rename is a copy + delete of every file, so the time it takes is proportional to the #of files, and the amount of data. And S3 throttles you: if you try to do this in parallel, it will potentially slow you down. Don't be surprised if it takes "a while".

You also get billed per COPY call, at 0.005 per 1,000 calls, so it will cost you ~$5 to try. Test on a small directory until you are sure everything is working
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题