In Spark, how to write header in a file, if there is no row in a dataframe?

懵懂的女人 提交于 2020-05-16 04:36:16

问题


I want to write a header in a file if there is no row in dataframe, Currently when I write an empty dataframe to a file then file is created but it does not have header in it.

I am writing dataframe using these setting and command:
Dataframe.repartition(1) \
         .write \
         .format("com.databricks.spark.csv") \
         .option("ignoreLeadingWhiteSpace", False) \
         .option("ignoreTrailingWhiteSpace", False) \
         .option("header", "true") \
         .save('/mnt/Bilal/Dataframe');

I want the header row in the file, even if there is no data row in a dataframe.


回答1:


if you want to have just header file. you can use fold left to create each column with white space and save that as your csv. I have not used pyspark but this is how it can be done in scala. majority of the code should be reusable you will have to just work on converting it to pyspark

val path ="/user/test"
val newdf=df.columns.foldleft(df){(tempdf,cols)=>
tempdf.withColumn(cols, lit(""))}

create a method for writing the header file

 def createHeaderFile(headerFilePath: String, colNames: Array[String]) {

//format header file path
val fileName = "yourfileName.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)

    val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)

for (h <- colNames) {
  writer.write(h + ",")
}
writer.write("\n")
writer.close()
}

call it on your DF

 createHeaderFile(path, newdf.columns)



回答2:


I had the same problem with you, in Pyspark. When dataframe was empty (e.g after a .filter() transformation) then the output was one empty csv without header.

So, I created a custom method which checks if the ouput CSVs is one empty CSV. If yes, then it only adds the header.

import glob
import csv

def add_header_in_one_empty_csv(exported_path, columns):
    list_of_csv_files = glob.glob(os.path.join(exported_path, '*.csv'))
    if len(list_of_csv_files) == 1:
        csv_file = list_of_csv_files[0]
        with open(csv_file, 'a') as f:
            if f.readline() == b'':
                header = ','.join(columns)
                f.write(header)

Example:

# Create a dummy Dataframe
df = spark.createDataFrame([(1,2), (1, 4), (3, 2), (1, 4)], ("a", "b"))

# Filter in order to create an empty Dataframe
filtered_df = df.filter(df['a']>10)

# Write the df without rows and no header
filtered_df.write.csv('output.csv', header='true')

# Add the header
add_header_in_one_empty_csv('output.csv', filtered_df.columns)


来源:https://stackoverflow.com/questions/56946600/in-spark-how-to-write-header-in-a-file-if-there-is-no-row-in-a-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!