Save a large Spark Dataframe as a single json file in S3

前端 未结 3 1978
星月不相逢
星月不相逢 2021-02-01 20:39

Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :

dataframe.repartition(1)         


        
相关标签:
3条回答
  • 2021-02-01 21:01

    I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.

    df.write.mode('append').json(yourtargetpath)
    
    0 讨论(0)
  • 2021-02-01 21:13

    Try this

    dataframe.write.format("org.apache.spark.sql.json").mode(SaveMode.Append).save("hdfs://localhost:9000/sampletext.txt");
    
    0 讨论(0)
  • 2021-02-01 21:16

    s3a is not production version in Spark I think. I would say the design is not sound. repartition(1) is going to be terrible (what you are telling spark is to merge all partitions to a single one). I would suggest to convince the downstream to download contents from a folder rather than a single file

    0 讨论(0)
提交回复
热议问题