PySpark: spit out single file when writing instead of multiple part files

前端 未结 3 1178
北荒
北荒 2021-01-02 12:50

Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?

If I run:

 df.write.format(\'json\').save(         


        
相关标签:
3条回答
  • 2021-01-02 13:15

    df1.rdd.repartition(1).write.json('myfile.json')

    Would be nice, but isn't available. Check this related question. https://stackoverflow.com/a/33311467/2843520

    0 讨论(0)
  • 2021-01-02 13:25

    Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.

    df.coalesce(1).write.format('json').save('myfile.json')
    

    P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.

    0 讨论(0)
  • 2021-01-02 13:30

    This was a better solution for me.

    rdd.map(json.dumps) .saveAsTextFile(json_lines_file_name)

    0 讨论(0)
提交回复
热议问题