Spark - Sum of row values

前端 未结 5 1118
囚心锁ツ
囚心锁ツ 2020-12-09 13:57

I have the following DataFrame:

January | February | March
-----------------------------
  10    |    10    |  10
  20    |    20    |  20
  50    |    50            


        
相关标签:
5条回答
  • 2020-12-09 14:18

    This code is in Python, but it can be easily translated:

    # First we create a RDD in order to create a dataFrame:
    rdd = sc.parallelize([(10, 10,10), (20, 20,20)])
    df = rdd.toDF(['January', 'February', 'March'])
    df.show()
    
    # Here, we create a new column called 'TOTAL' which has results
    # from add operation of columns df.January, df.February and df.March
    
    df.withColumn('TOTAL', df.January + df.February + df.March).show()
    

    Output:

    +-------+--------+-----+
    |January|February|March|
    +-------+--------+-----+
    |     10|      10|   10|
    |     20|      20|   20|
    +-------+--------+-----+
    
    +-------+--------+-----+-----+
    |January|February|March|TOTAL|
    +-------+--------+-----+-----+
    |     10|      10|   10|   30|
    |     20|      20|   20|   60|
    +-------+--------+-----+-----+
    

    You can also create an User Defined Function it you want, here a link of Spark documentation: UserDefinedFunction (udf)

    0 讨论(0)
  • 2020-12-09 14:22

    You can use expr() for this.In scala use

    df.withColumn("TOTAL", expr("January+February+March"))
    
    0 讨论(0)
  • 2020-12-09 14:27

    Alternatively and using Hugo's approach and example, you can create a UDF that receives any quantity of columns and sum them all.

    from functools import reduce
    
    def superSum(*cols):
       return reduce(lambda a, b: a + b, cols)
    
    add = udf(superSum)
    
    df.withColumn('total', add(*[df[x] for x in df.columns])).show()
    
    
    +-------+--------+-----+-----+
    |January|February|March|total|
    +-------+--------+-----+-----+
    |     10|      10|   10|   30|
    |     20|      20|   20|   60|
    +-------+--------+-----+-----+
    
    0 讨论(0)
  • 2020-12-09 14:35

    Working Scala example with dynamic column selection:

    import sqlContext.implicits._
    val rdd = sc.parallelize(Seq((10, 10, 10), (20, 20, 20)))
    val df = rdd.toDF("January", "February", "March")
    df.show()
    
    +-------+--------+-----+
    |January|February|March|
    +-------+--------+-----+
    |     10|      10|   10|
    |     20|      20|   20|
    +-------+--------+-----+
    
    val sumDF = df.withColumn("TOTAL", df.columns.map(c => col(c)).reduce((c1, c2) => c1 + c2))
    sumDF.show()
    
    +-------+--------+-----+-----+
    |January|February|March|TOTAL|
    +-------+--------+-----+-----+
    |     10|      10|   10|   30|
    |     20|      20|   20|   60|
    +-------+--------+-----+-----+
    
    0 讨论(0)
  • 2020-12-09 14:36

    You were very close with this:

    val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
    

    Instead, try this:

    val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")
    

    I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF. It's the best of both worlds -- and I didn't even add a full line of code!

    0 讨论(0)
提交回复
热议问题