Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?
If I run:
df.write.format(\'json\').save(
Well, the answer to your exact question is coalesce
function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.
df.coalesce(1).write.format('json').save('myfile.json')
P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.