问题
I am doing this in Scala and Spark.
I have and Dataset
of Tuple2
as Dataset[(String, Map[String, String])]
.
Below is and example of the values in the Dataset
.
(A, {1->100, 2->200, 3->100})
(B, {1->400, 4->300, 5->900})
(C, {6->100, 4->200, 5->100})
(B, {1->500, 9->300, 11->900})
(C, {7->100, 8->200, 5->800})
If you notice, the key or first element of the Tuple can be repeated. Also, the corresponding map of the same Tuples' key can have duplicate keys in the map (second part of Tuple2).
I want to create a final Dataset[(String, Map[String, String])]
. And the output should be as below (from the above example). Also, the last keys' value of the map is retained (check B and C) and previous same key against B and C is removed.
(A, {1->100, 2->200, 3->100})
(B, {4->300, 1->500, 9->300, 11->900, 5->900})
(C, {6->100, 4->200, 7->100, 8->200, 5->800})
Please let me know if any clarification is required.
回答1:
By using rdd,
val rdd = sc.parallelize(
Seq(("A", Map(1->100, 2->200, 3->100)),
("B", Map(1->400, 4->300, 5->900)),
("C", Map(6->100, 4->200, 5->100)),
("B", Map(1->500, 9->300, 11->900)),
("C", Map(7->100, 8->200, 5->800)))
)
rdd.reduceByKey((a, b) => a ++ b).collect()
// Array((A,Map(1 -> 100, 2 -> 200, 3 -> 100)), (B,Map(5 -> 900, 1 -> 500, 9 -> 300, 11 -> 900, 4 -> 300)), (C,Map(5 -> 800, 6 -> 100, 7 -> 100, 8 -> 200, 4 -> 200)))
and using dataframe,
val df = spark.createDataFrame(
Seq(("A", Map(1->100, 2->200, 3->100)),
("B", Map(1->400, 4->300, 5->900)),
("C", Map(6->100, 4->200, 5->100)),
("B", Map(1->500, 9->300, 11->900)),
("C", Map(7->100, 8->200, 5->800)))
).toDF("key", "map")
spark.conf.set("spark.sql.mapKeyDedupPolicy","LAST_WIN")
df.withColumn("map", map_entries($"map"))
.groupBy("key").agg(collect_list($"map").alias("map"))
.withColumn("map", flatten($"map"))
.withColumn("map", map_from_entries($"map")).show(false)
+---+---------------------------------------------------+
|key|map |
+---+---------------------------------------------------+
|B |[1 -> 500, 4 -> 300, 5 -> 900, 9 -> 300, 11 -> 900]|
|C |[6 -> 100, 4 -> 200, 5 -> 800, 7 -> 100, 8 -> 200] |
|A |[1 -> 100, 2 -> 200, 3 -> 100] |
+---+---------------------------------------------------+
回答2:
Using dataframes:
val df = Seq(("A", Map(1 -> 100, 2 -> 200, 3 -> 100)),
("B", Map(1 -> 400, 4 -> 300, 5 -> 900)),
("C", Map(6 -> 100, 4 -> 200, 5 -> 100)),
("B", Map(1 -> 500, 9 -> 300, 11 -> 900)),
("C", Map(7 -> 100, 8 -> 200, 5 -> 800))).toDF("a", "b")
val df2 = df.select('a, explode('b))
.groupBy("a", "key") //remove the duplicate keys
.agg(last('value).as("value")) //and take the last value for duplicate keys
.groupBy("a")
.agg(map_from_arrays(collect_list('key), collect_list('value)).as("b"))
df2.show()
prints
+---+---------------------------------------------------+
|a |b |
+---+---------------------------------------------------+
|B |[5 -> 900, 9 -> 300, 1 -> 500, 4 -> 300, 11 -> 900]|
|C |[6 -> 100, 8 -> 200, 7 -> 100, 4 -> 200, 5 -> 800] |
|A |[3 -> 100, 1 -> 100, 2 -> 200] |
+---+---------------------------------------------------+
As there are two aggregations involved, the rdd based answer is likely to be faster
来源:https://stackoverflow.com/questions/63647473/combine-value-part-of-tuple2-which-is-a-map-into-single-map-grouping-by-the-key