Spark / Scala: forward fill with last observation

后端 未结 2 1458
你的背包
你的背包 2020-11-27 14:43

Using Spark 1.4.0, Scala 2.10

I\'ve been trying to figure out a way to forward fill null values with the last known observation, but I don\'t see an easy way. I woul

相关标签:
2条回答
  • 2020-11-27 15:11

    Initial answer (a single time series assumption):

    First of all try avoid window functions if you cannot provide PARTITION BY clause. It moves data to a single partition so most of the time it is simply not feasible.

    What you can do is to fill gaps on RDD using mapPartitionsWithIndex. Since you didn't provide an example data or expected output consider this to be pseudocode not a real Scala program:

    • first lets order DataFrame by date and convert to RDD

      import org.apache.spark.sql.Row
      import org.apache.spark.rdd.RDD
      
      val rows: RDD[Row] = df.orderBy($"Date").rdd
      
    • next lets find the last not null observation per partition

      def notMissing(row: Row): Boolean = ???
      
      val toCarry: scala.collection.Map[Int,Option[org.apache.spark.sql.Row]] = rows
        .mapPartitionsWithIndex{ case (i, iter) => 
          Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }
        .collectAsMap
      
    • and convert this Map to broadcast

      val toCarryBd = sc.broadcast(toCarry)
      
    • finally map over partitions once again filling the gaps:

      def fill(i: Int, iter: Iterator[Row]): Iterator[Row] = {
        // If it is the beginning of partition and value is missing
        // extract value to fill from toCarryBd.value
        // Remember to correct for empty / only missing partitions
        // otherwise take last not-null from the current partition
      }
      
      val imputed: RDD[Row] = rows
        .mapPartitionsWithIndex{ case (i, iter) => fill(i, iter) } 
      
    • finally convert back to DataFrame

    Edit (partitioned / time series per group data):

    The devil is in the detail. If your data is partitioned after all then a whole problem can be solved using groupBy. Lets assume you simply partition by column "v" of type T and Date is an integer timestamp:

    def fill(iter: List[Row]): List[Row] = {
      // Just go row by row and fill with last non-empty value
      ???
    }
    
    val groupedAndSorted = df.rdd
      .groupBy(_.getAs[T]("k"))
      .mapValues(_.toList.sortBy(_.getAs[Int]("Date")))
    
    val rows: RDD[Row] = groupedAndSorted.mapValues(fill).values.flatMap(identity)
    
    val dfFilled = sqlContext.createDataFrame(rows, df.schema)
    

    This way you can fill all columns at the same time.

    Can this be done with DataFrames instead of converting back and forth to RDD?

    It depends, although it is unlikely to be efficient. If maximum gap is relatively small you can do something like this:

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions.{WindowSpec, Window}
    import org.apache.spark.sql.Column
    
    val maxGap: Int = ???  // Maximum gap between observations
    val columnsToFill: List[String] = ???  // List of columns to fill
    val suffix: String = "_" // To disambiguate between original and imputed 
    
    // Take lag 1 to maxGap and coalesce
    def makeCoalesce(w: WindowSpec)(magGap: Int)(suffix: String)(c: String) = {
      // Generate lag values between 1 and maxGap
      val lags = (1 to maxGap).map(lag(col(c), _)over(w))
      // Add current, coalesce and set alias
      coalesce(col(c) +: lags: _*).alias(s"$c$suffix")
    }
    
    
    // For each column you want to fill nulls apply makeCoalesce
    val lags: List[Column] = columnsToFill.map(makeCoalesce(w)(maxGap)("_"))
    
    
    // Finally select
    val dfImputed = df.select($"*" :: lags: _*)
    

    It can be easily adjusted to use different maximum gap per column.

    A simpler way to achieve a similar result in the latest Spark version is to use last with ignoreNulls:

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions.Window
    
    val w = Window.partitionBy($"k").orderBy($"Date")
      .rowsBetween(Window.unboundedPreceding, -1)
    
    df.withColumn("value", coalesce($"value", last($"value", true).over(w)))
    

    While it is possible to drop partitionBy clause and apply this method globally, it would prohibitively expensive with large datasets.

    0 讨论(0)
  • 2020-11-27 15:16

    It is possible to do it only using Window function (without last function) and somehow clever partitionning. I personally really dislike having to use the combination of groupBy then further join.

    So given :

    date,      currency, rate
    20190101   JPY       NULL
    20190102   JPY       2
    20190103   JPY       NULL
    20190104   JPY       NULL
    20190102   JPY       3
    20190103   JPY       4
    20190104   JPY       NULL
    

    We can use Window.unboundedPreceding and Window.unboundedFollowing to create a key for forward and backward fill.

    The following code :

    val w1 = Window.partitionBy("currency").orderBy(asc("date"))
    df
       .select("date", "currency", "rate")
       // Equivalent of fill.na(0, Seq("rate")) but can be more generic here
       // You may need an abs(col("rate")) if value col can be negative since it will not work with the following sums to build the foward and backward keys
       .withColumn("rate_filled", when(col("rate").isNull, lit(0)).otherwise(col("rate)))
       .withColumn("rate_backsum",
         sum("rate_filled").over(w1.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
       .withColumn("rate_forwardsum",
         sum("rate_filled").over(w1.rowsBetween(Window.currentRow, Window.unboundedFollowing)))
    

    gives :

    date,      currency, rate,  rate_filled, rate_backsum, rate_forwardsum
    20190101   JPY       NULL             0             0             9
    20190102   JPY       2                2             2             9
    20190103   JPY       NULL             0             2             7
    20190104   JPY       NULL             0             2             7
    20190102   JPY       3                3             5             7
    20190103   JPY       4                4             9             4
    20190104   JPY       NULL             0             9             0
    

    Therefore, we've built two keys (x_backsum and x_forwardsum) that can be used to ffill and bfill. With the two following spark lines :

    val wb = Window.partitionBy("currency", "rate_backsum")
    val wf = Window.partitionBy("currency", "rate_forwardsum")
    
       ...
       .withColumn("rate_backfilled", avg("rate").over(wb))
       .withColumn("rate_forwardfilled", avg("rate").over(wf))
    

    Finally :

    date,      currency, rate,   rate_backsum, rate_forwardsum, rate_ffilled
    20190101   JPY       NULL               0               9              2
    20190102   JPY       2                  2               9              2
    20190103   JPY       NULL               2               7              3
    20190104   JPY       NULL               2               7              3
    20190102   JPY       3                  5               7              3
    20190103   JPY       4                  9               4              4
    20190104   JPY       NULL               9               0              0
    
    0 讨论(0)
提交回复
热议问题