问题
I have a table like this:
+----------+-------------+
| date|BALANCE_DRAWN|
+----------+-------------+
|2017-01-10| 2.21496454E7|
|2018-01-01| 4.21496454E7|
|2018-01-04| 1.21496454E7|
|2018-01-07| 4.21496454E7|
|2018-01-10| 5.21496454E7|
|2019-01-01| 1.21496454E7|
|2019-01-04| 2.21496454E7|
|2019-01-07| 3.21496454E7|
|2019-01-10| 1.21496454E7|
|2020-01-01| 5.21496454E7|
|2020-01-04| 4.21496454E7|
|2020-01-07| 6.21496454E7|
|2020-01-10| 3.21496454E7|
|2021-01-01| 2.21496454E7|
|2021-01-04| 1.21496454E7|
|2021-01-07| 2.21496454E7|
|2021-01-10| 3.21496454E7|
|2022-01-01| 4.21496454E7|
|2022-01-04| 5.21496454E7|
|2022-01-07|2.209869511E7|
|2022-01-10|3.209869511E7|
+----------+-------------+
Is there a way to filter this dataframe, so I get something like this:
+----------+-------------+
| date|BALANCE_DRAWN|
+----------+-------------+
|2017-01-10| 2.21496454E7|
|2018-01-10| 5.21496454E7|
|2019-01-10| 1.21496454E7|
|2020-01-10| 3.21496454E7|
|2021-01-10| 3.21496454E7|
|2022-01-10|3.209869511E7|
+----------+-------------+
I.e. get the latest date from each year and the corresponding BALANCE_DRAWN row.
I managed to get it, but it is only for 1 case with the following code:
df = df.groupby([f.year("date")]).agg(f.last("BALANCE_DRAWN"))
But the output is only for year:
+----------+-------------+
| date|BALANCE_DRAWN|
+----------+-------------+
|2017 | 2.21496454E7|
|2018 | 5.21496454E7|
|2019 | 1.21496454E7|
|2020 | 3.21496454E7|
|2021 | 3.21496454E7|
|2022 |3.209869511E7|
+----------+-------------+
The result is good, but I need to make it more flexible. (not just for year)
UPDATE: Maybe max() can be used in some way. (Trying it, will update)
UPDATE 2: Accepted answer did it!
回答1:
df = df.withColumn('year', year(df['date']))
.groupBy(df['year'])
.agg(max(df['date']), first(df['BALANCE_DRAWN']))
来源:https://stackoverflow.com/questions/58853922/getting-latest-dates-from-each-year-in-a-pyspark-date-column