How can you write to multiple outputs dependent on the key using Spark in a single Job.
Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<
I had a similar use case where I split the input file on Hadoop HDFS into multiple files based on a key (1 file per key). Here is my scala code for spark
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);
@serializable object processGroup {
def apply(groupName:String, records:Iterable[String]): Unit = {
val outFileStream = fs.create(new Path("/output_dir/"+groupName))
for( line <- records ) {
outFileStream.writeUTF(line+"\n")
}
outFileStream.close()
}
}
val infile = sc.textFile("input_file")
val dateGrouped = infile.groupBy( _.split(",")(0))
dateGrouped.foreach( (x) => processGroup(x._1, x._2))
I have grouped the records based on key. The values for each key is written to separate file.