I have approx 1 millions text files stored in S3 . I want to rename all files based on their folders name.
How can i do that in spark-scala ?
I am looking fo
you can use the normal HDFS APIs, something like (typed in, not tested)
val src = new Path("s3a://bucket/data/src")
val dest = new Path("s3a://bucket/data/dest")
val conf = sc.hadoopConfiguration // assuming sc = spark context
val fs = src.getFileSystem(conf)
fs.rename(src, dest)
The way the S3A client fakes a rename is a copy + delete
of every file, so the time it takes is proportional to the #of files, and the amount of data. And S3 throttles you: if you try to do this in parallel, it will potentially slow you down. Don't be surprised if it takes "a while".
You also get billed per COPY call, at 0.005 per 1,000 calls, so it will cost you ~$5 to try. Test on a small directory until you are sure everything is working