Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1640
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

相关标签:
10条回答
  • 2020-11-22 06:10

    I have a similar need and found an way. But it has one drawback (which is not a problem for my case): you need to re-partition you data with one partition per output file.

    To partition in this way it generally requires to know beforehand how many files the job will output and find a function that will map each key to each partition.

    First let's create our MultipleTextOutputFormat-based class:

    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
    
    class KeyBasedOutput[T >: Null, V <: AnyRef] extends MultipleTextOutputFormat[T , V] {
      override def generateFileNameForKeyValue(key: T, value: V, leaf: String) = {
        key.toString
      }
      override protected def generateActualKey(key: T, value: V) = {
        null
      }
    }
    

    With this class Spark will get a key from a partition (the first/last, I guess) and name the file with this key, so it's not good to mix multiple keys on the same partition.

    For your example, you will require a custom partitioner. This will do the job:

    import org.apache.spark.Partitioner
    
    class IdentityIntPartitioner(maxKey: Int) extends Partitioner {
      def numPartitions = maxKey
    
      def getPartition(key: Any): Int = key match {
        case i: Int if i < maxKey => i
      }
    }
    

    Now let's put everything together:

    val rdd = sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"), (7, "d"), (7, "e")))
    
    // You need to know the max number of partitions (files) beforehand
    // In this case we want one partition per key and we have 3 keys,
    // with the biggest key being 7, so 10 will be large enough
    val partitioner = new IdentityIntPartitioner(10)
    
    val prefix = "hdfs://.../prefix"
    
    val partitionedRDD = rdd.partitionBy(partitioner)
    
    partitionedRDD.saveAsHadoopFile(prefix,
        classOf[Integer], classOf[String], classOf[KeyBasedOutput[Integer, String]])
    

    This will generate 3 files under prefix (named 1, 2 and 7), processing everything in one pass.

    As you can see, you need some knowledge about your keys to be able to use this solution.

    For me it was easier because I needed one output file for each key hash and the number of files was under my control, so I could use the stock HashPartitioner to do the trick.

    0 讨论(0)
  • 2020-11-22 06:11

    saveAsText() and saveAsHadoop(...) are implemented based on the RDD data, specifically by the method: PairRDD.saveAsHadoopDataset which takes the data from the PairRdd where it's executed. I see two possible options: If your data is relatively small in size, you could save some implementation time by grouping over the RDD, creating a new RDD from each collection and using that RDD to write the data. Something like this:

    val byKey = dataRDD.groupByKey().collect()
    val rddByKey = byKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)}
    val rddByKey.foreach{ case (k,rdd) => rdd.saveAsText(prefix+k}
    

    Note that it will not work for large datasets b/c the materialization of the iterator at v.toSeq might not fit in memory.

    The other option I see, and actually the one I'd recommend in this case is: roll your own, by directly calling the hadoop/hdfs api.

    Here's a discussion I started while researching this question: How to create RDDs from another RDD?

    0 讨论(0)
  • 2020-11-22 06:12

    If you potentially have many values for a given key, I think the scalable solution is to write out one file per key per partition. Unfortunately there is no built-in support for this in Spark, but we can whip something up.

    sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c")))
      .mapPartitionsWithIndex { (p, it) =>
        val outputs = new MultiWriter(p.toString)
        for ((k, v) <- it) {
          outputs.write(k.toString, v)
        }
        outputs.close
        Nil.iterator
      }
      .foreach((x: Nothing) => ()) // To trigger the job.
    
    // This one is Local, but you could write one for HDFS
    class MultiWriter(suffix: String) {
      private val writers = collection.mutable.Map[String, java.io.PrintWriter]()
      def write(key: String, value: Any) = {
        if (!writers.contains(key)) {
          val f = new java.io.File("output/" + key + "/" + suffix)
          f.getParentFile.mkdirs
          writers(key) = new java.io.PrintWriter(f)
        }
        writers(key).println(value)
      }
      def close = writers.values.foreach(_.close)
    }
    

    (Replace PrintWriter with your choice of distributed filesystem operation.)

    This makes a single pass over the RDD and performs no shuffle. It gives you one directory per key, with a number of files inside each.

    0 讨论(0)
  • 2020-11-22 06:13

    I had a similar use case where I split the input file on Hadoop HDFS into multiple files based on a key (1 file per key). Here is my scala code for spark

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    
    val hadoopconf = new Configuration();
    val fs = FileSystem.get(hadoopconf);
    
    @serializable object processGroup {
        def apply(groupName:String, records:Iterable[String]): Unit = {
            val outFileStream = fs.create(new Path("/output_dir/"+groupName))
            for( line <- records ) {
                    outFileStream.writeUTF(line+"\n")
                }
            outFileStream.close()
        }
    }
    val infile = sc.textFile("input_file")
    val dateGrouped = infile.groupBy( _.split(",")(0))
    dateGrouped.foreach( (x) => processGroup(x._1, x._2))
    

    I have grouped the records based on key. The values for each key is written to separate file.

    0 讨论(0)
提交回复
热议问题