Spark add new column to dataframe with value from previous row

前端 未结 2 926
星月不相逢
星月不相逢 2020-11-29 04:26

I\'m wondering how I can achieve the following in Spark (Pyspark)

Initial Dataframe:

+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|         


        
相关标签:
2条回答
  • 2020-11-29 04:56

    You can use lag window function as follows

    from pyspark.sql.functions import lag, col
    from pyspark.sql.window import Window
    
    df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
    w = Window().partitionBy().orderBy(col("id"))
    df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()
    
    ## +---+---+-------+
    ## | id|num|new_col|
    ## +---+---+-------|
    ## |  2|3.0|    5.0|
    ## |  3|7.0|    3.0|
    ## |  4|9.0|    7.0|
    ## +---+---+-------+
    

    but there some important issues:

    1. if you need a global operation (not partitioned by some other column / columns) it is extremely inefficient.
    2. you need a natural way to order your data.

    While the second issue is almost never a problem the first one can be a deal-breaker. If this is the case you should simply convert your DataFrame to RDD and compute lag manually. See for example:

    • How to transform data with sliding window over time series data in Pyspark
    • Apache Spark Moving Average (written in Scala, but can be adjusted for PySpark. Be sure to read the comments first).

    Other useful links:

    • https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/05_moving-average-imputation.ipynb
    • Spark Window Functions - rangeBetween dates
    0 讨论(0)
  • 2020-11-29 05:03
       val df = sc.parallelize(Seq((4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0))).toDF("id", "num")
    df.show
    +---+---+
    | id|num|
    +---+---+
    |  4|9.0|
    |  3|7.0|
    |  2|3.0|
    |  1|5.0|
    +---+---+
    df.withColumn("new_column", lag("num", 1, 0).over(w)).show
    +---+---+----------+
    | id|num|new_column|
    +---+---+----------+
    |  1|5.0|       0.0|
    |  2|3.0|       5.0|
    |  3|7.0|       3.0|
    |  4|9.0|       7.0|
    +---+---+----------+
    
    0 讨论(0)
提交回复
热议问题