Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1620
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  北恋
    北恋 (楼主)
    2020-11-22 05:52

    If you use Spark 1.4+, this has become much, much easier thanks to the DataFrame API. (DataFrames were introduced in Spark 1.3, but partitionBy(), which we need, was introduced in 1.4.)

    If you're starting out with an RDD, you'll first need to convert it to a DataFrame:

    val people_rdd = sc.parallelize(Seq((1, "alice"), (1, "bob"), (2, "charlie")))
    val people_df = people_rdd.toDF("number", "name")
    

    In Python, this same code is:

    people_rdd = sc.parallelize([(1, "alice"), (1, "bob"), (2, "charlie")])
    people_df = people_rdd.toDF(["number", "name"])
    

    Once you have a DataFrame, writing to multiple outputs based on a particular key is simple. What's more -- and this is the beauty of the DataFrame API -- the code is pretty much the same across Python, Scala, Java and R:

    people_df.write.partitionBy("number").text("people")
    

    And you can easily use other output formats if you want:

    people_df.write.partitionBy("number").json("people-json")
    people_df.write.partitionBy("number").parquet("people-parquet")
    

    In each of these examples, Spark will create a subdirectory for each of the keys that we've partitioned the DataFrame on:

    people/
      _SUCCESS
      number=1/
        part-abcd
        part-efgh
      number=2/
        part-abcd
        part-efgh
    

提交回复
热议问题