One of the method is to use orderBy
(default is ascending order), groupBy
and aggregation first
import org.apache.spark.sql.functions.first
df.orderBy("level").groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)
You can define the order as well by using .asc
for ascending and .desc
for descending as below
df.orderBy($"level".asc).groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)
And you can do the operation using window
and row_number
function too as below
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("item_id", "country_id").orderBy($"level".asc)
import org.apache.spark.sql.functions.row_number
df.withColumn("rank", row_number().over(windowSpec)).filter($"rank" === 1).drop("rank").show()