Get latest records in a data frame based on time stamp with condition

问题

My Question heading might not be accurate but i hope i will be able to explain my question So i have a data frame like below

DataPartition_1|^|PartitionYear_1|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber_1|^|FFAction_1
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|5|^|21|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629599|^|4295858941|^|1|^|22|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629600|^|4295858941|^|1|^|23|^|2|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629624|^|4295858941|^|null|^|35|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629626|^|4295858941|^|null|^|37|^|null|^|D|!|
SelfSourcedPublic|^|2001|^|1510044629596|^|4295858941|^|19|^|5|^|1|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629597|^|4295858941|^|20|^|5|^|2|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|21|^|5|^|2|^|I|!|

So the primary key for my data frame is

OrganizationId", "AnnualPeriodId","InterimPeriodId"

below is my code to get the latest records based on time stamp and arranged by seq of primary key .

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

Now my issue is some time i get null in some of the primary key column like record with time stamp 1510044629624.

Now my requirement is that below records have same primary key except first one has null .In this case i still need only one records with latest time stamp

    SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

I should get SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

My current code gives output with both records because of the null ..

I hope my question is clear.

回答1:

From what I understood from your question and problem is that you are using an extra column as primary key.

AnnualPeriodId column is getting null and since you are using that field in partitionBy, it is causing the null to be separate group and thus separate row

val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)

So the solution is to remove it from the partitionBy so that the above line becomes

val windowSpec = Window.partitionBy("OrganizationId", "InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)

I hope this should solve the issue you are having.

来源：https://stackoverflow.com/questions/47269172/get-latest-records-in-a-data-frame-based-on-time-stamp-with-condition

标签

apache-spark

apache-spark-sql

spark-dataframe