I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the c
Something like this should work:
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
Which you can use like:
foo(microBatchOutputDF)(keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols"))