How to drop duplicates using conditions [duplicate]

后端未结

关注

 1  515

南旧

相关标签:

1条回答

悲&欢浪女

2021-01-07 15:30

One of the method is to use orderBy (default is ascending order), groupBy and aggregation first

import org.apache.spark.sql.functions.first
df.orderBy("level").groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)

You can define the order as well by using .asc for ascending and .desc for descending as below

df.orderBy($"level".asc).groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)

And you can do the operation using window and row_number function too as below

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("item_id", "country_id").orderBy($"level".asc)

import org.apache.spark.sql.functions.row_number
df.withColumn("rank", row_number().over(windowSpec)).filter($"rank" === 1).drop("rank").show()

0 讨论(0)

热议问题