SAS Proc Freq with PySpark (Frequency, percent, cumulative frequency, and cumulative percent)

问题

I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does use the best what Spark can offer, as the code will run with massive datasets. In this other post (which was also adapted for this StackOverflow answer), I also found instructions to compute distributed groupwise cumulative sums in PySpark, but not sure how to adapt it to my end.

Here's an input and output example (my original dataset will have a couple of billions rows):

Input dataset:

        state
0       Delaware
1       Delaware
2       Delaware
3       Indiana
4       Indiana
...     ...
1020    West Virginia
1021    West Virginia
1022    West Virginia
1023    West Virginia
1024    West Virginia

1025 rows × 1 columns

Expected output:

    state           Frequency   Percent Cumulative Frequency    Cumulative Percent
0   Vermont         246         24.00   246                     24.00
1   New Hampshire   237         23.12   483                     47.12
2   Missouri        115         11.22   598                     58.34
3   North Carolina  100         9.76    698                     68.10
4   Indiana         92          8.98    790                     77.07
5   Montana         56          5.46    846                     82.54
6   West Virginia   55          5.37    901                     87.90
7   North Dakota    53          5.17    954                     93.07
8   Washington      39          3.80    993                     96.88
9   Utah            29          2.83    1022                    99.71
10  Delaware        3           0.29    1025                    100.00

回答1:

You can first group by state to get the frequency and percent, then use sum over a window to get the cumulative frequency and percent:

result = df.groupBy('state').agg(
    F.count('state').alias('Frequency')
).selectExpr(
    '*',
    '100 * Frequency / sum(Frequency) over() Percent'
).selectExpr(
    '*',
    'sum(Frequency) over(order by Frequency desc) Cumulative_Frequency', 
    'sum(Percent) over(order by Frequency desc) Cumulative_Percent'
)

result.show()
+-------------+---------+-------+--------------------+------------------+
|        state|Frequency|Percent|Cumulative_Frequency|Cumulative_Percent|
+-------------+---------+-------+--------------------+------------------+
|West Virginia|        5|   50.0|                   5|              50.0|
|     Delaware|        3|   30.0|                   8|              80.0|
|      Indiana|        2|   20.0|                  10|             100.0|
+-------------+---------+-------+--------------------+------------------+

来源：https://stackoverflow.com/questions/65787753/sas-proc-freq-with-pyspark-frequency-percent-cumulative-frequency-and-cumula

标签

python

apache-spark

pyspark

statistics

cumulative-frequency