Write to multiple outputs by key Spark - one Spark job

后端未结

关注

 10  1642

挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答

醉酒成梦 (楼主)

2020-11-22 06:10
I have a similar need and found an way. But it has one drawback (which is not a problem for my case): you need to re-partition you data with one partition per output file.

To partition in this way it generally requires to know beforehand how many files the job will output and find a function that will map each key to each partition.

First let's create our MultipleTextOutputFormat-based class:
```
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

class KeyBasedOutput[T >: Null, V <: AnyRef] extends MultipleTextOutputFormat[T , V] {
  override def generateFileNameForKeyValue(key: T, value: V, leaf: String) = {
    key.toString
  }
  override protected def generateActualKey(key: T, value: V) = {
    null
  }
}
```
With this class Spark will get a key from a partition (the first/last, I guess) and name the file with this key, so it's not good to mix multiple keys on the same partition.

For your example, you will require a custom partitioner. This will do the job:
```
import org.apache.spark.Partitioner

class IdentityIntPartitioner(maxKey: Int) extends Partitioner {
  def numPartitions = maxKey

  def getPartition(key: Any): Int = key match {
    case i: Int if i < maxKey => i
  }
}
```
Now let's put everything together:
```
val rdd = sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"), (7, "d"), (7, "e")))

// You need to know the max number of partitions (files) beforehand
// In this case we want one partition per key and we have 3 keys,
// with the biggest key being 7, so 10 will be large enough
val partitioner = new IdentityIntPartitioner(10)

val prefix = "hdfs://.../prefix"

val partitionedRDD = rdd.partitionBy(partitioner)

partitionedRDD.saveAsHadoopFile(prefix,
    classOf[Integer], classOf[String], classOf[KeyBasedOutput[Integer, String]])
```
This will generate 3 files under prefix (named 1, 2 and 7), processing everything in one pass.

As you can see, you need some knowledge about your keys to be able to use this solution.

For me it was easier because I needed one output file for each key hash and the number of files was under my control, so I could use the stock HashPartitioner to do the trick.
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...