Spark get top N highest score results for each (item1, item2, score)

一世执手 提交于 2020-01-24 00:34:04

问题


I have a DataFrame of the following format:

item_id1: Long, item_id2: Long, similarity_score: Double

What I'm trying to do is to get top N highest similarity_score records for each item_id1. So, for example:

1 2 0.5
1 3 0.4
1 4 0.3
2 1 0.5
2 3 0.4
2 4 0.3

With top 2 similar items would give:

1 2 0.5
1 3 0.4
2 1 0.5
2 3 0.4

I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to implement it in Spark Scala.

Thank you.


回答1:


I would suggest to use window-functions for this:

 df
  .withColumn("rnk",row_number().over(Window.partitionBy($"item_id1").orderBy($"similarity_score")))
  .where($"rank"<=2)

Alternatively, you could use dense_rank/rank instead of row_number, depending on how to handle cases where the similarity-score is equal.



来源:https://stackoverflow.com/questions/49200522/spark-get-top-n-highest-score-results-for-each-item1-item2-score

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!