Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1624
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  孤独总比滥情好
    2020-11-22 06:12

    If you potentially have many values for a given key, I think the scalable solution is to write out one file per key per partition. Unfortunately there is no built-in support for this in Spark, but we can whip something up.

    sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c")))
      .mapPartitionsWithIndex { (p, it) =>
        val outputs = new MultiWriter(p.toString)
        for ((k, v) <- it) {
          outputs.write(k.toString, v)
        }
        outputs.close
        Nil.iterator
      }
      .foreach((x: Nothing) => ()) // To trigger the job.
    
    // This one is Local, but you could write one for HDFS
    class MultiWriter(suffix: String) {
      private val writers = collection.mutable.Map[String, java.io.PrintWriter]()
      def write(key: String, value: Any) = {
        if (!writers.contains(key)) {
          val f = new java.io.File("output/" + key + "/" + suffix)
          f.getParentFile.mkdirs
          writers(key) = new java.io.PrintWriter(f)
        }
        writers(key).println(value)
      }
      def close = writers.values.foreach(_.close)
    }
    

    (Replace PrintWriter with your choice of distributed filesystem operation.)

    This makes a single pass over the RDD and performs no shuffle. It gives you one directory per key, with a number of files inside each.

提交回复
热议问题