Change output file name in Spark Streaming

问题

I am running a Spark job which performs extremely well as far as the logic goes. However, the name of my output files are in the format part-00000,part-00001 etc., when I use saveAsTextFile to save the files in a s3 bucket. Is there a way to change the output filename?

Thank you.

回答1:

In Spark, you can use saveAsNewAPIHadoopFile and set mapreduce.output.basename parameter in hadoop configuration to change the prefix (Just the "part" prefix)

val hadoopConf = new Configuration()
hadoopConf.set("mapreduce.output.basename", "yourPrefix")

yourRDD.map(str => (null, str))
        .saveAsNewAPIHadoopFile(s"$outputPath/$dirName", classOf[NullWritable], classOf[String],
          classOf[TextOutputFormat[NullWritable, String]], hadoopConf)

Your files will be named like: yourPrefix-r-00001

In hadoop and Spark, you can have more than one file in the output since you can have more than one reducer(hadoop) or more than one partition(spark). Then you need to warranty unique names for each of them, that is why it is not possible to override the sequence number at the last part of the filename.

But if you want to have more control of your filename, you can extend TextOutputFormat or FileOutputFormat and override the getUniqueFile method.

回答2:

[Solution in Java]

Lets say you have :

JavaRDD<Text> rows;

And you want to write it to files like customPrefix-r-00000 .

Configuration hadoopConf = new Configuration();
hadoopConf.set("mapreduce.output.basename", "customPrefix");

rows.mapToPair(row -> new Tuple2(null, row)).saveAsNewAPIHadoopFile(outputPath, NullWritable.class, Text.class, TextOutputFormat.class, hadoopConf);

Tada!!

来源：https://stackoverflow.com/questions/37972381/change-output-file-name-in-spark-streaming

标签

Hadoop

apache-spark

spark-streaming

spark-dataframe