Calculating percentage of total count for groupBy using pyspark

后端未结

关注

 2  1736

I have the following code in pyspark, resulting in a table showing me the different values for a column and their counts. I want to have another column showing what percenta

相关标签:

2条回答

佛祖请我去吃肉

2021-01-02 16:14

An example as an alternative if not comfortable with Windowing as the comment alludes to and is the better way to go:

# Running in Databricks, not all stuff required
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
#from pyspark.sql.functions import col

data = [("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
        ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)]
rdd = sc.parallelize(data)

someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=int(x[2]), val2=int(x[3])))

df = sqlContext.createDataFrame(someschema)

tot = df.count()

df.groupBy("c1") \
  .count() \
  .withColumnRenamed('count', 'cnt_per_group') \
  .withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \
  .show()

returns:

 +---+-------------+-------------------+
| c1|cnt_per_group|perc_of_count_total|
+---+-------------+-------------------+
|  E|            1| 16.666666666666664|
|  B|            1| 16.666666666666664|
|  D|            1| 16.666666666666664|
|  C|            1| 16.666666666666664|
|  A|            2|  33.33333333333333|
+---+-------------+-------------------+

I focus on Scala and it seems easier with that. That said, the suggested solution via the comments uses Window which is what I would do in Scala with over().

0 讨论(0)

无人共我

2021-01-02 16:31

You can groupby and aggregate with agg. For example, for the following DataFrame:

+--------+-----+
|category|value|
+--------+-----+
|       a|    1|
|       b|    2|
|       a|    3|
+--------+-----+

You can use:

import pyspark.sql.functions as F

df.groupby('category').agg(
    (F.count('value')).alias('count'),
    (F.count('value') / df.count()).alias('percentage')
).show()

Output:

+--------+-----+------------------+
|category|count|        percentage|
+--------+-----+------------------+
|       b|    1|0.3333333333333333|
|       a|    2|0.6666666666666666|
+--------+-----+------------------+

Alternatively, you can use SQL:

df.createOrReplaceTempView('df')

spark.sql(
    """
    SELECT category,
           COUNT(*) AS count,
           COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
    FROM df
    GROUP BY category
    """
).show()

0 讨论(0)