Efficiently calculate row totals of a wide Spark DF

后端 未结 1 1697
感动是毒
感动是毒 2021-01-22 08:03

I have a wide spark data frame of a few thousand columns by about a million rows, for which I would like to calculate the row totals. My solution so far is below. I used: dplyr

相关标签:
1条回答
  • 2021-01-22 08:11

    You're out of luck here. One way or another you're are going to hit some recursion limits (even if you go around SQL parser, sufficiently large sum of expressions will crash query planner). There are some slow solutions available:

    • Use spark_apply (at the cost of conversion to and from R):

      wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) })
      
    • Convert to long format and aggregate (at the cost of explode and shuffle):

      key_expr <- "monotonically_increasing_id() AS key"
      
      value_expr <- paste(
       "explode(array(", paste(colnames(wide_sdf), collapse=","), ")) AS value"
      )
      
      wide_sdf %>% 
        spark_dataframe() %>% 
        # Add id and explode. We need a separate invoke so id is applied
        # before "lateral view"
        sparklyr::invoke("selectExpr", list(key_expr, "*")) %>% 
        sparklyr::invoke("selectExpr", list("key", value_expr)) %>% 
        sdf_register() %>% 
        # Aggregate by id
        group_by(key) %>% 
        summarize(total = sum(value)) %>% 
        arrange(key)
      

    To get something more efficient you should consider writing Scala extension and applying sum directly on a Row object, without exploding:

    package com.example.sparklyr.rowsum
    
    import org.apache.spark.sql.{DataFrame, Encoders}
    
    object RowSum {
      def apply(df: DataFrame, cols: Seq[String]) = df.map {
        row => cols.map(c => row.getAs[Double](c)).sum
      }(Encoders.scalaDouble)
    }
    

    and

    invoke_static(
      sc, "com.example.sparklyr.rowsum.RowSum", "apply",
      wide_sdf %>% spark_dataframe
    ) %>% sdf_register()
    
    0 讨论(0)
提交回复
热议问题