问题
I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.
回答1:
writing DataFrame to HDFS (Spark 1.6).
df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
some of the format options are csv
, parquet
, json
etc.
reading DataFrame from HDFS (Spark 1.6).
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) sqlContext.read.format('parquet').load('/path/to/file')
the format method takes argument such as parquet
, csv
, json
etc.
来源:https://stackoverflow.com/questions/44290548/how-to-write-pyspark-dataframe-to-hdfs-and-then-how-to-read-it-back-into-datafra