How can you write to multiple outputs dependent on the key using Spark in a single Job.
Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<
I have a similar need and found an way. But it has one drawback (which is not a problem for my case): you need to re-partition you data with one partition per output file.
To partition in this way it generally requires to know beforehand how many files the job will output and find a function that will map each key to each partition.
First let's create our MultipleTextOutputFormat-based class:
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class KeyBasedOutput[T >: Null, V <: AnyRef] extends MultipleTextOutputFormat[T , V] {
override def generateFileNameForKeyValue(key: T, value: V, leaf: String) = {
key.toString
}
override protected def generateActualKey(key: T, value: V) = {
null
}
}
With this class Spark will get a key from a partition (the first/last, I guess) and name the file with this key, so it's not good to mix multiple keys on the same partition.
For your example, you will require a custom partitioner. This will do the job:
import org.apache.spark.Partitioner
class IdentityIntPartitioner(maxKey: Int) extends Partitioner {
def numPartitions = maxKey
def getPartition(key: Any): Int = key match {
case i: Int if i < maxKey => i
}
}
Now let's put everything together:
val rdd = sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"), (7, "d"), (7, "e")))
// You need to know the max number of partitions (files) beforehand
// In this case we want one partition per key and we have 3 keys,
// with the biggest key being 7, so 10 will be large enough
val partitioner = new IdentityIntPartitioner(10)
val prefix = "hdfs://.../prefix"
val partitionedRDD = rdd.partitionBy(partitioner)
partitionedRDD.saveAsHadoopFile(prefix,
classOf[Integer], classOf[String], classOf[KeyBasedOutput[Integer, String]])
This will generate 3 files under prefix (named 1, 2 and 7), processing everything in one pass.
As you can see, you need some knowledge about your keys to be able to use this solution.
For me it was easier because I needed one output file for each key hash and the number of files was under my control, so I could use the stock HashPartitioner to do the trick.