Calculating percentage of total count for groupBy using pyspark

后端 未结 2 1736
心在旅途
心在旅途 2021-01-02 15:42

I have the following code in pyspark, resulting in a table showing me the different values for a column and their counts. I want to have another column showing what percenta

相关标签:
2条回答
  • 2021-01-02 16:14

    An example as an alternative if not comfortable with Windowing as the comment alludes to and is the better way to go:

    # Running in Databricks, not all stuff required
    from pyspark.sql import Row
    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    from pyspark.sql.types import *
    #from pyspark.sql.functions import col
    
    data = [("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
            ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)]
    rdd = sc.parallelize(data)
    
    someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=int(x[2]), val2=int(x[3])))
    
    df = sqlContext.createDataFrame(someschema)
    
    tot = df.count()
    
    df.groupBy("c1") \
      .count() \
      .withColumnRenamed('count', 'cnt_per_group') \
      .withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \
      .show()
    

    returns:

     +---+-------------+-------------------+
    | c1|cnt_per_group|perc_of_count_total|
    +---+-------------+-------------------+
    |  E|            1| 16.666666666666664|
    |  B|            1| 16.666666666666664|
    |  D|            1| 16.666666666666664|
    |  C|            1| 16.666666666666664|
    |  A|            2|  33.33333333333333|
    +---+-------------+-------------------+
    

    I focus on Scala and it seems easier with that. That said, the suggested solution via the comments uses Window which is what I would do in Scala with over().

    0 讨论(0)
  • 2021-01-02 16:31

    You can groupby and aggregate with agg. For example, for the following DataFrame:

    +--------+-----+
    |category|value|
    +--------+-----+
    |       a|    1|
    |       b|    2|
    |       a|    3|
    +--------+-----+
    

    You can use:

    import pyspark.sql.functions as F
    
    df.groupby('category').agg(
        (F.count('value')).alias('count'),
        (F.count('value') / df.count()).alias('percentage')
    ).show()
    

    Output:

    +--------+-----+------------------+
    |category|count|        percentage|
    +--------+-----+------------------+
    |       b|    1|0.3333333333333333|
    |       a|    2|0.6666666666666666|
    +--------+-----+------------------+
    

    Alternatively, you can use SQL:

    df.createOrReplaceTempView('df')
    
    spark.sql(
        """
        SELECT category,
               COUNT(*) AS count,
               COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
        FROM df
        GROUP BY category
        """
    ).show()
    
    0 讨论(0)
提交回复
热议问题