Getting latest dates from each year in a PySpark date column

独自空忆成欢 提交于 2019-12-12 20:24:02

问题


I have a table like this:

+----------+-------------+
|      date|BALANCE_DRAWN|
+----------+-------------+
|2017-01-10| 2.21496454E7|
|2018-01-01| 4.21496454E7|
|2018-01-04| 1.21496454E7|
|2018-01-07| 4.21496454E7|
|2018-01-10| 5.21496454E7|
|2019-01-01| 1.21496454E7|
|2019-01-04| 2.21496454E7|
|2019-01-07| 3.21496454E7|
|2019-01-10| 1.21496454E7|
|2020-01-01| 5.21496454E7|
|2020-01-04| 4.21496454E7|
|2020-01-07| 6.21496454E7|
|2020-01-10| 3.21496454E7|
|2021-01-01| 2.21496454E7|
|2021-01-04| 1.21496454E7|
|2021-01-07| 2.21496454E7|
|2021-01-10| 3.21496454E7|
|2022-01-01| 4.21496454E7|
|2022-01-04| 5.21496454E7|
|2022-01-07|2.209869511E7|
|2022-01-10|3.209869511E7|
+----------+-------------+

Is there a way to filter this dataframe, so I get something like this:

+----------+-------------+
|      date|BALANCE_DRAWN|
+----------+-------------+
|2017-01-10| 2.21496454E7|
|2018-01-10| 5.21496454E7|
|2019-01-10| 1.21496454E7|
|2020-01-10| 3.21496454E7|
|2021-01-10| 3.21496454E7|
|2022-01-10|3.209869511E7|
+----------+-------------+

I.e. get the latest date from each year and the corresponding BALANCE_DRAWN row.

I managed to get it, but it is only for 1 case with the following code:

df = df.groupby([f.year("date")]).agg(f.last("BALANCE_DRAWN"))

But the output is only for year:

+----------+-------------+
|      date|BALANCE_DRAWN|
+----------+-------------+
|2017      | 2.21496454E7|
|2018      | 5.21496454E7|
|2019      | 1.21496454E7|
|2020      | 3.21496454E7|
|2021      | 3.21496454E7|
|2022      |3.209869511E7|
+----------+-------------+

The result is good, but I need to make it more flexible. (not just for year)

UPDATE: Maybe max() can be used in some way. (Trying it, will update)

UPDATE 2: Accepted answer did it!


回答1:


df = df.withColumn('year', year(df['date']))
       .groupBy(df['year'])
       .agg(max(df['date']), first(df['BALANCE_DRAWN']))


来源:https://stackoverflow.com/questions/58853922/getting-latest-dates-from-each-year-in-a-pyspark-date-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!