Getting latest dates from each year in a PySpark date column

问题

I have a table like this:

+----------+-------------+
|      date|BALANCE_DRAWN|
+----------+-------------+
|2017-01-10| 2.21496454E7|
|2018-01-01| 4.21496454E7|
|2018-01-04| 1.21496454E7|
|2018-01-07| 4.21496454E7|
|2018-01-10| 5.21496454E7|
|2019-01-01| 1.21496454E7|
|2019-01-04| 2.21496454E7|
|2019-01-07| 3.21496454E7|
|2019-01-10| 1.21496454E7|
|2020-01-01| 5.21496454E7|
|2020-01-04| 4.21496454E7|
|2020-01-07| 6.21496454E7|
|2020-01-10| 3.21496454E7|
|2021-01-01| 2.21496454E7|
|2021-01-04| 1.21496454E7|
|2021-01-07| 2.21496454E7|
|2021-01-10| 3.21496454E7|
|2022-01-01| 4.21496454E7|
|2022-01-04| 5.21496454E7|
|2022-01-07|2.209869511E7|
|2022-01-10|3.209869511E7|
+----------+-------------+

Is there a way to filter this dataframe, so I get something like this:

+----------+-------------+
|      date|BALANCE_DRAWN|
+----------+-------------+
|2017-01-10| 2.21496454E7|
|2018-01-10| 5.21496454E7|
|2019-01-10| 1.21496454E7|
|2020-01-10| 3.21496454E7|
|2021-01-10| 3.21496454E7|
|2022-01-10|3.209869511E7|
+----------+-------------+

I.e. get the latest date from each year and the corresponding BALANCE_DRAWN row.

I managed to get it, but it is only for 1 case with the following code:

df = df.groupby([f.year("date")]).agg(f.last("BALANCE_DRAWN"))

But the output is only for year:

+----------+-------------+
|      date|BALANCE_DRAWN|
+----------+-------------+
|2017      | 2.21496454E7|
|2018      | 5.21496454E7|
|2019      | 1.21496454E7|
|2020      | 3.21496454E7|
|2021      | 3.21496454E7|
|2022      |3.209869511E7|
+----------+-------------+

The result is good, but I need to make it more flexible. (not just for year)

UPDATE: Maybe max() can be used in some way. (Trying it, will update)

UPDATE 2: Accepted answer did it!

回答1:

df = df.withColumn('year', year(df['date']))
       .groupBy(df['year'])
       .agg(max(df['date']), first(df['BALANCE_DRAWN']))

来源：https://stackoverflow.com/questions/58853922/getting-latest-dates-from-each-year-in-a-pyspark-date-column

标签

python

pyspark

apache-spark-sql

pyspark-sql