Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1642
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  醉酒成梦
    2020-11-22 06:10

    I have a similar need and found an way. But it has one drawback (which is not a problem for my case): you need to re-partition you data with one partition per output file.

    To partition in this way it generally requires to know beforehand how many files the job will output and find a function that will map each key to each partition.

    First let's create our MultipleTextOutputFormat-based class:

    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
    
    class KeyBasedOutput[T >: Null, V <: AnyRef] extends MultipleTextOutputFormat[T , V] {
      override def generateFileNameForKeyValue(key: T, value: V, leaf: String) = {
        key.toString
      }
      override protected def generateActualKey(key: T, value: V) = {
        null
      }
    }
    

    With this class Spark will get a key from a partition (the first/last, I guess) and name the file with this key, so it's not good to mix multiple keys on the same partition.

    For your example, you will require a custom partitioner. This will do the job:

    import org.apache.spark.Partitioner
    
    class IdentityIntPartitioner(maxKey: Int) extends Partitioner {
      def numPartitions = maxKey
    
      def getPartition(key: Any): Int = key match {
        case i: Int if i < maxKey => i
      }
    }
    

    Now let's put everything together:

    val rdd = sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"), (7, "d"), (7, "e")))
    
    // You need to know the max number of partitions (files) beforehand
    // In this case we want one partition per key and we have 3 keys,
    // with the biggest key being 7, so 10 will be large enough
    val partitioner = new IdentityIntPartitioner(10)
    
    val prefix = "hdfs://.../prefix"
    
    val partitionedRDD = rdd.partitionBy(partitioner)
    
    partitionedRDD.saveAsHadoopFile(prefix,
        classOf[Integer], classOf[String], classOf[KeyBasedOutput[Integer, String]])
    

    This will generate 3 files under prefix (named 1, 2 and 7), processing everything in one pass.

    As you can see, you need some knowledge about your keys to be able to use this solution.

    For me it was easier because I needed one output file for each key hash and the number of files was under my control, so I could use the stock HashPartitioner to do the trick.

提交回复
热议问题