How rename S3 files not HDFS in spark scala

后端 未结 1 1891
野趣味
野趣味 2021-01-16 03:54

I have approx 1 millions text files stored in S3 . I want to rename all files based on their folders name.

How can i do that in spark-scala ?

I am looking fo

相关标签:
1条回答
  • 2021-01-16 04:10

    you can use the normal HDFS APIs, something like (typed in, not tested)

    val src = new Path("s3a://bucket/data/src")
    val dest = new Path("s3a://bucket/data/dest")
    val conf = sc.hadoopConfiguration   // assuming sc = spark context
    val fs = src.getFileSystem(conf)
    fs.rename(src, dest)
    

    The way the S3A client fakes a rename is a copy + delete of every file, so the time it takes is proportional to the #of files, and the amount of data. And S3 throttles you: if you try to do this in parallel, it will potentially slow you down. Don't be surprised if it takes "a while".

    You also get billed per COPY call, at 0.005 per 1,000 calls, so it will cost you ~$5 to try. Test on a small directory until you are sure everything is working

    0 讨论(0)
提交回复
热议问题