Get latest records in a data frame based on time stamp with condition

左心房为你撑大大i 提交于 2020-01-07 04:55:27

问题


My Question heading might not be accurate but i hope i will be able to explain my question So i have a data frame like below

DataPartition_1|^|PartitionYear_1|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber_1|^|FFAction_1
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|5|^|21|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629599|^|4295858941|^|1|^|22|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629600|^|4295858941|^|1|^|23|^|2|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629624|^|4295858941|^|null|^|35|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629626|^|4295858941|^|null|^|37|^|null|^|D|!|
SelfSourcedPublic|^|2001|^|1510044629596|^|4295858941|^|19|^|5|^|1|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629597|^|4295858941|^|20|^|5|^|2|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|21|^|5|^|2|^|I|!|

So the primary key for my data frame is

OrganizationId", "AnnualPeriodId","InterimPeriodId"

below is my code to get the latest records based on time stamp and arranged by seq of primary key .

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

Now my issue is some time i get null in some of the primary key column like record with time stamp 1510044629624.

Now my requirement is that below records have same primary key except first one has null .In this case i still need only one records with latest time stamp

    SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

I should get SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

My current code gives output with both records because of the null ..

I hope my question is clear.


回答1:


From what I understood from your question and problem is that you are using an extra column as primary key.

AnnualPeriodId column is getting null and since you are using that field in partitionBy, it is causing the null to be separate group and thus separate row

val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 

So the solution is to remove it from the partitionBy so that the above line becomes

val windowSpec = Window.partitionBy("OrganizationId", "InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 

I hope this should solve the issue you are having.



来源:https://stackoverflow.com/questions/47269172/get-latest-records-in-a-data-frame-based-on-time-stamp-with-condition

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!