How to calculate sum and count in a single groupBy?

前端 未结 3 559
醉梦人生
醉梦人生 2020-12-28 16:03

Based on the following DataFrame:

val client = Seq((1,\"A\",10),(2,\"A\",5),(3,\"B\",56)).toDF(\"ID\",\"Categ\",\"Amnt\")
+---+-----+----+
| ID|         


        
相关标签:
3条回答
  • 2020-12-28 16:21

    I'm giving different example than yours

    multiple group functions are possible like this. try it accordingly

      // In 1.3.x, in order for the grouping column "department" to show up,
    // it must be included explicitly as part of the agg function call.
    df.groupBy("department").agg($"department", max("age"), sum("expense"))
    
    // In 1.4+, grouping column "department" is included automatically.
    df.groupBy("department").agg(max("age"), sum("expense"))
    

    import org.apache.spark.sql.{DataFrame, SparkSession}
    import org.apache.spark.sql.functions._
    
    val spark: SparkSession = SparkSession
          .builder.master("local")
          .appName("MyGroup")
          .getOrCreate()
    import spark.implicits._
        val client: DataFrame = spark.sparkContext.parallelize(
    Seq((1,"A",10),(2,"A",5),(3,"B",56))
    ).toDF("ID","Categ","Amnt")
    
    client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show()
    

    +-----+---------+---------+
    |Categ|sum(Amnt)|count(ID)|
    +-----+---------+---------+
    |    B|       56|        1|
    |    A|       15|        2|
    +-----+---------+---------+
    
    0 讨论(0)
  • 2020-12-28 16:32

    There are multiple ways to do aggregate functions in spark,

    val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt")
    

    1.

    val aggdf = client.groupBy('Categ).agg(Map("ID"->"count","Amnt"->"sum"))
    
    +-----+---------+---------+
    |Categ|count(ID)|sum(Amnt)|
    +-----+---------+---------+
    |B    |1        |56       |
    |A    |2        |15       |
    +-----+---------+---------+
    
    //Rename and sort as needed.
    aggdf.sort('Categ).withColumnRenamed("count(ID)","Count").withColumnRenamed("sum(Amnt)","sum")
    +-----+-----+---+
    |Categ|Count|sum|
    +-----+-----+---+
    |A    |2    |15 |
    |B    |1    |56 |
    +-----+-----+---+
    

    2.

    import org.apache.spark.sql.functions._
    client.groupBy('Categ).agg(count("ID").as("count"),sum("Amnt").as("sum"))
    +-----+-----+---+
    |Categ|count|sum|
    +-----+-----+---+
    |B    |1    |56 |
    |A    |2    |15 |
    +-----+-----+---+
    

    3.

    import com.google.common.collect.ImmutableMap;
    client.groupBy('Categ).agg(ImmutableMap.of("ID", "count", "Amnt", "sum"))
    +-----+---------+---------+
    |Categ|count(ID)|sum(Amnt)|
    +-----+---------+---------+
    |B    |1        |56       |
    |A    |2        |15       |
    +-----+---------+---------+
    //Use column rename is required. 
    

    4. If you are SQL expert, you can do this too

    client.createOrReplaceTempView("df")
    
     val aggdf = spark.sql("select Categ, count(ID),sum(Amnt) from df group by Categ")
     aggdf.show()
    
        +-----+---------+---------+
        |Categ|count(ID)|sum(Amnt)|
        +-----+---------+---------+
        |    B|        1|       56|
        |    A|        2|       15|
        +-----+---------+---------+
    
    0 讨论(0)
  • 2020-12-28 16:38

    You can do aggregation like below on given table:

    client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show()
    
    +-----+---------+---------+
    |Categ|sum(Amnt)|count(ID)|
    +-----+---------+---------+
    |    A|       15|        2|
    |    B|       56|        1|
    +-----+---------+---------+
    
    0 讨论(0)
提交回复
热议问题