How to use “cube” only for specific fields on Spark dataframe?

最后都变了- 提交于 2019-12-07 03:27:19

问题


I'm using Spark 1.6.1, and I have such a dataframe.

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|
......

And I want to get dataframe like as following by using cube operation.

(Grouped by all fields, but only "os_name", "country", "app_ver" fields are cubed)

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|cnt|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|  9|
|    test_home|scene_enter|        test_home|   null|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test| 35|
|    test_home|scene_enter|        test_home|android|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test| 98|
|    test_home|scene_enter|        test_home|android|     KR|   null|__OTHERS__|  false|   test|   test|   test|101|
|    test_home|scene_enter|        test_home|   null|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test|301|
|    test_home|scene_enter|        test_home|   null|     KR|   null|__OTHERS__|  false|   test|   test|   test|225|
|    test_home|scene_enter|        test_home|android|   null|   null|__OTHERS__|  false|   test|   test|   test|312|
|    test_home|scene_enter|        test_home|   null|   null|   null|__OTHERS__|  false|   test|   test|   test|521|
......

I have tried like below, but it seems to be slow and ugly..

var cubed = df
  .cube($"scene_id", $"action_id", $"classifier", $"country", $"os_name", $"app_ver", $"p0value", $"p1value", $"p2value", $"p3value", $"p4value")
  .count
  .where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL")

Any better solutions?


回答1:


I believe you cannot avoid the problem completely but there is a simple trick you can reduce its scale. The idea is to replace all columns, which shouldn't be marginalized, with a single placeholder.

For example if you have a DataFrame:

val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")

and you're interested in cube marginalized by d and e and grouped by a..c you can define the substitute for a..c as:

import org.apache.spark.sql.functions.struct
import sparkSql.implicits._

// alias here may not work in Spark 1.6
val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest")

and cube:

val cubed =  Seq($"d", $"e")

// If there is a problem with aliasing rest it can done here.
val tmp = df.cube(rest.alias("rest") +: cubed: _*).count

Quick filter and select should handle the rest:

tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)

with result like:

+---+---+---+----+----+-----+
|  a|  b|  c|   d|   e|count|
+---+---+---+----+----+-----+
|  1|  2|  3|null|   5|    1|
|  1|  2|  3|null|null|    1|
|  1|  2|  3|   4|   5|    1|
|  1|  2|  3|   4|null|    1|
+---+---+---+----+----+-----+


来源:https://stackoverflow.com/questions/40763325/how-to-use-cube-only-for-specific-fields-on-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!