I have a dataframe which contains records identified by a key. But there might be a case where a key can get repetitive. My goal is to merge all the records based on that key as
If you know there is only one record for group which is not null (or you don't care which one you'll get), you can use first
:
import org.apache.spark.sql.functions.{first, last}
val df = Seq(
("a", Some(1), None, None), ("a", None, Some(2), None),
("a", None, None, Some(3))
).toDF("key", "value1", "value2", "value3")
df.groupBy("key").agg(
first("value1", true) as "value1",
first("value2", true) as "value2",
first("value3", true) as "value3"
).show
// +---+------+------+------+
// |key|value1|value2|value3|
// +---+------+------+------+
// | a| 1| 2| 3|
// +---+------+------+------+
or last
:
df.groupBy("key").agg(
last("value1", true) as "value1",
last("value2", true) as "value2",
last("value3", true) as "value3"
).show
// +---+------+------+------+
// |key|value1|value2|value3|
// +---+------+------+------+
// | a| 1| 2| 3|
// +---+------+------+------+