I want to write a header in a file if there is no row in dataframe, Currently when I write an empty dataframe to a file then file is created but it does not have header in it.
I am writing dataframe using these setting and command:
Dataframe.repartition(1) \
.write \
.format("com.databricks.spark.csv") \
.option("ignoreLeadingWhiteSpace", False) \
.option("ignoreTrailingWhiteSpace", False) \
.option("header", "true") \
I want the header row in the file, even if there is no data row in a dataframe.
if you want to have just header file. you can use fold left to create each column with white space and save that as your csv. I have not used pyspark but this is how it can be done in scala. majority of the code should be reusable you will have to just work on converting it to pyspark
val path ="/user/test"
val newdf=df.columns.foldleft(df){(tempdf,cols)=>
tempdf.withColumn(cols, lit(""))}
create a method for writing the header file
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "yourfileName.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
call it on your DF
createHeaderFile(path, newdf.columns)
I had the same problem with you, in Pyspark. When dataframe was empty (e.g after a .filter()
transformation) then the output was one empty csv without header.
So, I created a custom method which checks if the ouput CSVs is one empty CSV. If yes, then it only adds the header.
import glob
import csv
def add_header_in_one_empty_csv(exported_path, columns):
list_of_csv_files = glob.glob(os.path.join(exported_path, '*.csv'))
if len(list_of_csv_files) == 1:
csv_file = list_of_csv_files[0]
with open(csv_file, 'a') as f:
if f.readline() == b'':
header = ','.join(columns)
# Create a dummy Dataframe
df = spark.createDataFrame([(1,2), (1, 4), (3, 2), (1, 4)], ("a", "b"))
# Filter in order to create an empty Dataframe
filtered_df = df.filter(df['a']>10)
# Write the df without rows and no header
filtered_df.write.csv('output.csv', header='true')
# Add the header
add_header_in_one_empty_csv('output.csv', filtered_df.columns)