问题
i'm trying to save DataFrame into CSV using the new spark 2.1 csv option
df.select(myColumns: _*).write
.mode(SaveMode.Overwrite)
.option("header", "true")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.csv(absolutePath)
everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix
i.e
part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz
Anyone knows how i can remove this file ext and stay only with part-000XX convension
Thanks
回答1:
You can remove the UUID by overriding the configuration option "spark.sql.sources.writeJobUUID":
https://github.com/apache/spark/commit/0818fdec3733ec5c0a9caa48a9c0f2cd25f84d13#diff-c69b9e667e93b7e4693812cc72abb65fR75
Unfortunately this solution will not fully mirror the old saveAsTextFile style (i.e. part-00000), but could make the output file name more sane such as part-00000-output.csv.gz where "output" is the value you pass to spark.sql.sources.writeJobUUID
. The "-" is automatically appended
SPARK-8406 is the relevant Spark issue and here's the actual Pull Request: https://github.com/apache/spark/pull/6864
来源:https://stackoverflow.com/questions/42870726/spark-csv-2-1-file-names