How to merge all part files in a folder created by SPARK data frame and rename as folder name in scala

♀尐吖头ヾ 提交于 2019-12-24 18:31:53

问题


Hi i have output of my spark data frame which creates folder structure and creates so may part files . Now i have to merge all part files inside the folder and rename that one file as folder path name .

This is how i do partition

df.write.partitionBy("DataPartition","PartitionYear")
  .format("csv")
  .option("nullValue", "")
  .option("header", "true")/
  .option("codec", "gzip")
  .save("hdfs:///user/zeppelin/FinancialLineItem/output")

It creates folder structure like this

hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00001-87a61115-92c9-4926-a803-b46315e55a08.c000.csv.gz
hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00002-87a61115-92c9-4926-a803-b46315e55a08.c001.csv.gz

I have to create final file like this

hdfs:///user/zeppelin/FinancialLineItem/output/Japan.1971.currenttime.csv.gz

No part files here bith 001 and 002 is merged two one .

My data size it very big 300 GB gzip and 35 GB zipped so coalesce(1) and repartition becomes very slow .

I have seen one solution here Write single CSV file using spark-csv but i am not able to implement it please help me with it .

Repartition throw error

error: value repartition is not a member of org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row]
       dfMainOutputFinalWithoutNull.write.repartition("DataPartition","StatementTypeCode")

回答1:


Try this from the head node outside of Spark...

hdfs dfs -getmerge <src> <localdst>

https://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#getmerge

"Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file."



来源:https://stackoverflow.com/questions/46812388/how-to-merge-all-part-files-in-a-folder-created-by-spark-data-frame-and-rename-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!