Change output file name in Spark Streaming

让人想犯罪 __ 提交于 2020-01-01 12:10:49

问题


I am running a Spark job which performs extremely well as far as the logic goes. However, the name of my output files are in the format part-00000,part-00001 etc., when I use saveAsTextFile to save the files in a s3 bucket. Is there a way to change the output filename?

Thank you.


回答1:


In Spark, you can use saveAsNewAPIHadoopFile and set mapreduce.output.basename parameter in hadoop configuration to change the prefix (Just the "part" prefix)

val hadoopConf = new Configuration()
hadoopConf.set("mapreduce.output.basename", "yourPrefix")

yourRDD.map(str => (null, str))
        .saveAsNewAPIHadoopFile(s"$outputPath/$dirName", classOf[NullWritable], classOf[String],
          classOf[TextOutputFormat[NullWritable, String]], hadoopConf)

Your files will be named like: yourPrefix-r-00001

In hadoop and Spark, you can have more than one file in the output since you can have more than one reducer(hadoop) or more than one partition(spark). Then you need to warranty unique names for each of them, that is why it is not possible to override the sequence number at the last part of the filename.

But if you want to have more control of your filename, you can extend TextOutputFormat or FileOutputFormat and override the getUniqueFile method.




回答2:


[Solution in Java]

Lets say you have :

JavaRDD<Text> rows;

And you want to write it to files like customPrefix-r-00000 .

Configuration hadoopConf = new Configuration();
hadoopConf.set("mapreduce.output.basename", "customPrefix");

rows.mapToPair(row -> new Tuple2(null, row)).saveAsNewAPIHadoopFile(outputPath, NullWritable.class, Text.class, TextOutputFormat.class, hadoopConf);

Tada!!



来源:https://stackoverflow.com/questions/37972381/change-output-file-name-in-spark-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!