Combine value part of Tuple2 which is a map, into single map grouping by the key of Tuple2

↘锁芯ラ 提交于 2021-01-28 05:45:13

问题


I am doing this in Scala and Spark.

I have and Dataset of Tuple2 as Dataset[(String, Map[String, String])].

Below is and example of the values in the Dataset.

(A, {1->100, 2->200, 3->100})
(B, {1->400, 4->300, 5->900})
(C, {6->100, 4->200, 5->100})
(B, {1->500, 9->300, 11->900})
(C, {7->100, 8->200, 5->800})

If you notice, the key or first element of the Tuple can be repeated. Also, the corresponding map of the same Tuples' key can have duplicate keys in the map (second part of Tuple2).

I want to create a final Dataset[(String, Map[String, String])]. And the output should be as below (from the above example). Also, the last keys' value of the map is retained (check B and C) and previous same key against B and C is removed.

(A, {1->100, 2->200, 3->100})
(B, {4->300, 1->500, 9->300, 11->900, 5->900})
(C, {6->100, 4->200, 7->100, 8->200, 5->800})

Please let me know if any clarification is required.


回答1:


By using rdd,

val rdd = sc.parallelize(
    Seq(("A", Map(1->100, 2->200, 3->100)),
        ("B", Map(1->400, 4->300, 5->900)),
        ("C", Map(6->100, 4->200, 5->100)),
        ("B", Map(1->500, 9->300, 11->900)),
        ("C", Map(7->100, 8->200, 5->800)))
)

rdd.reduceByKey((a, b) => a ++ b).collect()

// Array((A,Map(1 -> 100, 2 -> 200, 3 -> 100)), (B,Map(5 -> 900, 1 -> 500, 9 -> 300, 11 -> 900, 4 -> 300)), (C,Map(5 -> 800, 6 -> 100, 7 -> 100, 8 -> 200, 4 -> 200)))

and using dataframe,

val df = spark.createDataFrame(
    Seq(("A", Map(1->100, 2->200, 3->100)),
        ("B", Map(1->400, 4->300, 5->900)),
        ("C", Map(6->100, 4->200, 5->100)),
        ("B", Map(1->500, 9->300, 11->900)),
        ("C", Map(7->100, 8->200, 5->800)))
).toDF("key", "map")

spark.conf.set("spark.sql.mapKeyDedupPolicy","LAST_WIN")

df.withColumn("map", map_entries($"map"))
  .groupBy("key").agg(collect_list($"map").alias("map"))
  .withColumn("map", flatten($"map"))
  .withColumn("map", map_from_entries($"map")).show(false)

+---+---------------------------------------------------+
|key|map                                                |
+---+---------------------------------------------------+
|B  |[1 -> 500, 4 -> 300, 5 -> 900, 9 -> 300, 11 -> 900]|
|C  |[6 -> 100, 4 -> 200, 5 -> 800, 7 -> 100, 8 -> 200] |
|A  |[1 -> 100, 2 -> 200, 3 -> 100]                     |
+---+---------------------------------------------------+



回答2:


Using dataframes:

val df = Seq(("A", Map(1 -> 100, 2 -> 200, 3 -> 100)),
    ("B", Map(1 -> 400, 4 -> 300, 5 -> 900)),
    ("C", Map(6 -> 100, 4 -> 200, 5 -> 100)),
    ("B", Map(1 -> 500, 9 -> 300, 11 -> 900)),
    ("C", Map(7 -> 100, 8 -> 200, 5 -> 800))).toDF("a", "b")

val df2 = df.select('a, explode('b))
    .groupBy("a", "key")           //remove the duplicate keys
    .agg(last('value).as("value")) //and take the last value for duplicate keys
    .groupBy("a")
    .agg(map_from_arrays(collect_list('key), collect_list('value)).as("b"))
df2.show()

prints

+---+---------------------------------------------------+
|a  |b                                                  |
+---+---------------------------------------------------+
|B  |[5 -> 900, 9 -> 300, 1 -> 500, 4 -> 300, 11 -> 900]|
|C  |[6 -> 100, 8 -> 200, 7 -> 100, 4 -> 200, 5 -> 800] |
|A  |[3 -> 100, 1 -> 100, 2 -> 200]                     |
+---+---------------------------------------------------+

As there are two aggregations involved, the rdd based answer is likely to be faster



来源:https://stackoverflow.com/questions/63647473/combine-value-part-of-tuple2-which-is-a-map-into-single-map-grouping-by-the-key

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!