I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does use the best what Spark can offer, as the code will run with massive datasets. In this other post (which was also adapted for this StackOverflow answer), I also found instructions to compute distributed groupwise cumulative sums in PySpark, but not sure how to adapt it to my end.
Here's an input and output example (my original dataset will have a couple of billions rows):
Input dataset:
0 Delaware
1 Delaware
2 Delaware
3 Indiana
4 Indiana
... ...
1020 West Virginia
1021 West Virginia
1022 West Virginia
1023 West Virginia
1024 West Virginia
1025 rows × 1 columns
Expected output:
state Frequency Percent Cumulative Frequency Cumulative Percent
0 Vermont 246 24.00 246 24.00
1 New Hampshire 237 23.12 483 47.12
2 Missouri 115 11.22 598 58.34
3 North Carolina 100 9.76 698 68.10
4 Indiana 92 8.98 790 77.07
5 Montana 56 5.46 846 82.54
6 West Virginia 55 5.37 901 87.90
7 North Dakota 53 5.17 954 93.07
8 Washington 39 3.80 993 96.88
9 Utah 29 2.83 1022 99.71
10 Delaware 3 0.29 1025 100.00
You can first group by state to get the frequency and percent, then use sum
over a window to get the cumulative frequency and percent:
result = df.groupBy('state').agg(
'100 * Frequency / sum(Frequency) over() Percent'
'sum(Frequency) over(order by Frequency desc) Cumulative_Frequency',
'sum(Percent) over(order by Frequency desc) Cumulative_Percent'
| state|Frequency|Percent|Cumulative_Frequency|Cumulative_Percent|
|West Virginia| 5| 50.0| 5| 50.0|
| Delaware| 3| 30.0| 8| 80.0|
| Indiana| 2| 20.0| 10| 100.0|