How to use “cube” only for specific fields on Spark dataframe?

坚强是说给别人听的谎言 提交于 2019-12-05 06:42:32

I believe you cannot avoid the problem completely but there is a simple trick you can reduce its scale. The idea is to replace all columns, which shouldn't be marginalized, with a single placeholder.

For example if you have a DataFrame:

val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")

and you're interested in cube marginalized by d and e and grouped by a..c you can define the substitute for a..c as:

import org.apache.spark.sql.functions.struct
import sparkSql.implicits._

// alias here may not work in Spark 1.6
val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest")

and cube:

val cubed =  Seq($"d", $"e")

// If there is a problem with aliasing rest it can done here.
val tmp = df.cube(rest.alias("rest") +: cubed: _*).count

Quick filter and select should handle the rest:

tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)

with result like:

+---+---+---+----+----+-----+
|  a|  b|  c|   d|   e|count|
+---+---+---+----+----+-----+
|  1|  2|  3|null|   5|    1|
|  1|  2|  3|null|null|    1|
|  1|  2|  3|   4|   5|    1|
|  1|  2|  3|   4|null|    1|
+---+---+---+----+----+-----+
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!