Combine value part of Tuple2 which is a map, into single map grouping by the key of Tuple2

问题

I am doing this in Scala and Spark.

I have and Dataset of Tuple2 as Dataset[(String, Map[String, String])].

Below is and example of the values in the Dataset.

(A, {1->100, 2->200, 3->100})
(B, {1->400, 4->300, 5->900})
(C, {6->100, 4->200, 5->100})
(B, {1->500, 9->300, 11->900})
(C, {7->100, 8->200, 5->800})

If you notice, the key or first element of the Tuple can be repeated. Also, the corresponding map of the same Tuples' key can have duplicate keys in the map (second part of Tuple2).

I want to create a final Dataset[(String, Map[String, String])]. And the output should be as below (from the above example). Also, the last keys' value of the map is retained (check B and C) and previous same key against B and C is removed.

(A, {1->100, 2->200, 3->100})
(B, {4->300, 1->500, 9->300, 11->900, 5->900})
(C, {6->100, 4->200, 7->100, 8->200, 5->800})

Please let me know if any clarification is required.

回答1:

By using rdd,

val rdd = sc.parallelize(
    Seq(("A", Map(1->100, 2->200, 3->100)),
        ("B", Map(1->400, 4->300, 5->900)),
        ("C", Map(6->100, 4->200, 5->100)),
        ("B", Map(1->500, 9->300, 11->900)),
        ("C", Map(7->100, 8->200, 5->800)))
)

rdd.reduceByKey((a, b) => a ++ b).collect()

// Array((A,Map(1 -> 100, 2 -> 200, 3 -> 100)), (B,Map(5 -> 900, 1 -> 500, 9 -> 300, 11 -> 900, 4 -> 300)), (C,Map(5 -> 800, 6 -> 100, 7 -> 100, 8 -> 200, 4 -> 200)))

and using dataframe,

val df = spark.createDataFrame(
    Seq(("A", Map(1->100, 2->200, 3->100)),
        ("B", Map(1->400, 4->300, 5->900)),
        ("C", Map(6->100, 4->200, 5->100)),
        ("B", Map(1->500, 9->300, 11->900)),
        ("C", Map(7->100, 8->200, 5->800)))
).toDF("key", "map")

spark.conf.set("spark.sql.mapKeyDedupPolicy","LAST_WIN")

df.withColumn("map", map_entries($"map"))
  .groupBy("key").agg(collect_list($"map").alias("map"))
  .withColumn("map", flatten($"map"))
  .withColumn("map", map_from_entries($"map")).show(false)

+---+---------------------------------------------------+
|key|map                                                |
+---+---------------------------------------------------+
|B  |[1 -> 500, 4 -> 300, 5 -> 900, 9 -> 300, 11 -> 900]|
|C  |[6 -> 100, 4 -> 200, 5 -> 800, 7 -> 100, 8 -> 200] |
|A  |[1 -> 100, 2 -> 200, 3 -> 100]                     |
+---+---------------------------------------------------+

回答2:

Using dataframes:

val df = Seq(("A", Map(1 -> 100, 2 -> 200, 3 -> 100)),
    ("B", Map(1 -> 400, 4 -> 300, 5 -> 900)),
    ("C", Map(6 -> 100, 4 -> 200, 5 -> 100)),
    ("B", Map(1 -> 500, 9 -> 300, 11 -> 900)),
    ("C", Map(7 -> 100, 8 -> 200, 5 -> 800))).toDF("a", "b")

val df2 = df.select('a, explode('b))
    .groupBy("a", "key")           //remove the duplicate keys
    .agg(last('value).as("value")) //and take the last value for duplicate keys
    .groupBy("a")
    .agg(map_from_arrays(collect_list('key), collect_list('value)).as("b"))
df2.show()

prints

+---+---------------------------------------------------+
|a  |b                                                  |
+---+---------------------------------------------------+
|B  |[5 -> 900, 9 -> 300, 1 -> 500, 4 -> 300, 11 -> 900]|
|C  |[6 -> 100, 8 -> 200, 7 -> 100, 4 -> 200, 5 -> 800] |
|A  |[3 -> 100, 1 -> 100, 2 -> 200]                     |
+---+---------------------------------------------------+

As there are two aggregations involved, the rdd based answer is likely to be faster

来源：https://stackoverflow.com/questions/63647473/combine-value-part-of-tuple2-which-is-a-map-into-single-map-grouping-by-the-key

标签

scala

dataframe

apache-spark

dataset

databricks