How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

一笑奈何 提交于 2020-05-28 13:46:55

问题


I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.


回答1:


  • writing DataFrame to HDFS (Spark 1.6).

    df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
    

some of the format options are csv, parquet, json etc.

  • reading DataFrame from HDFS (Spark 1.6).

    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    sqlContext.read.format('parquet').load('/path/to/file') 
    

the format method takes argument such as parquet, csv, json etc.



来源:https://stackoverflow.com/questions/44290548/how-to-write-pyspark-dataframe-to-hdfs-and-then-how-to-read-it-back-into-datafra

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!