Write each row of a spark dataframe as a separate file

大兔子大兔子 提交于 2019-12-19 04:24:22

问题


I have Spark Dataframe with a single column, where each row is a long string (actually an xml file). I want to go through the DataFrame and save a string from each row as a text file, they can be called simply 1.xml, 2.xml, and so on.

I cannot seem to find any information or examples on how to do this. And I am just starting to work with Spark and PySpark. Maybe map a function on the DataFrame, but the function will have to write string to text file, I can't find how to do this.


回答1:


When saving a dataframe with Spark, one file will be created for each partition. Hence, one way to get a single row per file would be to first repartition the data to as many partitions as you have rows.

There is a library on github for reading and writing XML files with Spark. However, the dataframe needs to have a special format to produce correct XML. In this case, since you have everything as a string in a single column, the easiest way to save would probably be as csv.

The repartition and saving can be done as follows:

rows = df.count()
df.repartition(rows).write.csv('save-dir')



回答2:


I would do it this way in Java and Hadoop FileSystem API. You can write similar code using Python.

List<String> strings = Arrays.asList("file1", "file2", "file3");
JavaRDD<String> stringrdd = new JavaSparkContext().parallelize(strings);
stringrdd.collect().foreach(x -> {
    Path outputPath = new Path(x);
    Configuration conf = getConf();
    FileSystem fs = FileSystem.get(conf);
    OutputStream os = fs.create(outputPath);
});


来源:https://stackoverflow.com/questions/49883129/write-each-row-of-a-spark-dataframe-as-a-separate-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!