Drop consecutive duplicates in a pyspark dataframe

前端 未结 1 1480
醉酒成梦
醉酒成梦 2021-01-19 05:03

Having a dataframe like:

## +---+---+
## | id|num|
## +---+---+
## |  2|3.0|
## |  3|6.0|
## |  3|2.0|
## |  3|1.0|
## |  2|9.0|
## |  4|7.0|
## +---+---+


        
相关标签:
1条回答
  • 2021-01-19 05:33

    The answer should work as you desired, however there might be room for some optimization:

    from pyspark.sql.window import Window as W
    test_df = spark.createDataFrame([
        (2,3.0),(3,6.0),(3,2.0),(3,1.0),(2,9.0),(4,7.0)
        ], ("id", "num"))
    test_df = test_df.withColumn("idx", monotonically_increasing_id())  # create temporary ID because window needs an ordered structure
    w = W.orderBy("idx")
    get_last= when(lag("id", 1).over(w) == col("id"), False).otherwise(True) # check if the previous row contains the same id
    
    test_df.withColumn("changed",get_last).filter(col("changed")).select("id","num").show() # only select the rows with a changed ID
    

    Output:

    +---+---+
    | id|num|
    +---+---+
    |  2|3.0|
    |  3|6.0|
    |  2|9.0|
    |  4|7.0|
    +---+---+
    
    0 讨论(0)
提交回复
热议问题