In Spark, how to write header in a file, if there is no row in a dataframe?

问题

I want to write a header in a file if there is no row in dataframe, Currently when I write an empty dataframe to a file then file is created but it does not have header in it.

I am writing dataframe using these setting and command:
Dataframe.repartition(1) \
         .write \
         .format("com.databricks.spark.csv") \
         .option("ignoreLeadingWhiteSpace", False) \
         .option("ignoreTrailingWhiteSpace", False) \
         .option("header", "true") \
         .save('/mnt/Bilal/Dataframe');

I want the header row in the file, even if there is no data row in a dataframe.

回答1:

if you want to have just header file. you can use fold left to create each column with white space and save that as your csv. I have not used pyspark but this is how it can be done in scala. majority of the code should be reusable you will have to just work on converting it to pyspark

val path ="/user/test"
val newdf=df.columns.foldleft(df){(tempdf,cols)=>
tempdf.withColumn(cols, lit(""))}

create a method for writing the header file

 def createHeaderFile(headerFilePath: String, colNames: Array[String]) {

//format header file path
val fileName = "yourfileName.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)

    val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)

for (h <- colNames) {
  writer.write(h + ",")
}
writer.write("\n")
writer.close()
}

call it on your DF

 createHeaderFile(path, newdf.columns)

回答2:

I had the same problem with you, in Pyspark. When dataframe was empty (e.g after a .filter() transformation) then the output was one empty csv without header.

So, I created a custom method which checks if the ouput CSVs is one empty CSV. If yes, then it only adds the header.

import glob
import csv

def add_header_in_one_empty_csv(exported_path, columns):
    list_of_csv_files = glob.glob(os.path.join(exported_path, '*.csv'))
    if len(list_of_csv_files) == 1:
        csv_file = list_of_csv_files[0]
        with open(csv_file, 'a') as f:
            if f.readline() == b'':
                header = ','.join(columns)
                f.write(header)

Example:

# Create a dummy Dataframe
df = spark.createDataFrame([(1,2), (1, 4), (3, 2), (1, 4)], ("a", "b"))

# Filter in order to create an empty Dataframe
filtered_df = df.filter(df['a']>10)

# Write the df without rows and no header
filtered_df.write.csv('output.csv', header='true')

# Add the header
add_header_in_one_empty_csv('output.csv', filtered_df.columns)

来源：https://stackoverflow.com/questions/56946600/in-spark-how-to-write-header-in-a-file-if-there-is-no-row-in-a-dataframe

标签

pyspark

header

apache-spark-sql

writing