Spark aggregations where output columns are functions and rows are columns

扶醉桌前 提交于 2020-02-25 05:06:27

问题


I want to compute a bunch of different agg functions on different columns in a dataframe.

I know I can do something like this, but the output is all one row.

df.agg(max("cola"), min("cola"), max("colb"), min("colb"))

Let's say I will be performing 100 different aggregations on 10 different columns.

I want the output dataframe to be like this -

      |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc..
cola  | 1 | 10| ... 
colb  | 2 | NULL| ... 
colc  | 5 | 20| ... 
cold  | NULL | 42| ... 
...

Where my rows are each column I am performing aggregations on and my columns are the aggregation functions. Some areas will be null if I don't calculated colb max for example.

How can I accomplish this?


回答1:


You can create a Map column, say Metrics, where keys are column names and values a struct of aggregations (max, min, avg, ...). I am using map_from_entries function to create a map column (available from Spark 2.4+). And then, simply explode the map to get the structure you want.

Here is an example you can adapt for your requirement:

df = spark.createDataFrame([("A", 1, 2), ("B", 2, 4), ("C", 5, 6), ("D", 6, 8)], ['cola', 'colb', 'colc'])

agg = map_from_entries(array(
    *[
        struct(lit(c),
               struct(max(c).alias("Max"), min(c).alias("Min"))
               )
        for c in df.columns
    ])).alias("Metrics")

df.agg(agg).select(explode("Metrics").alias("col", "Metrics")) \
    .select("col", "Metrics.*") \
    .show()

#+----+---+---+
#|col |Max|Min|
#+----+---+---+
#|cola|D  |A  |
#|colb|6  |1  |
#|colc|8  |2  |
#+----+---+---+



回答2:


Here is one solution which allows you to set aggregations dynamically from a predefined list. The solution uses map_from_arrays among others therefore is compatible with Spark >= 2.4.0:

from pyspark.sql.functions import lit, expr, array, map_from_arrays

df = spark.createDataFrame([
  [1, 2.3, 5000],
  [2, 5.3, 4000],
  [3, 2.1, 3000],
  [4, 1.5, 4500]
], ["cola", "colb", "colc"])

aggs = ["min", "max", "avg", "sum"]
aggs_select_expr = [f"value[{idx}] as {agg}" for idx, agg in enumerate(aggs)]

agg_keys = []
agg_values = []

# generate map here where key is col name and value an array of aggregations
for c in df.columns:
  agg_keys.append(lit(c)) # the key i.e cola
  agg_values.append(array(*[expr(f"{a}({c})") for a in aggs])) # the value i.e [expr("min(a)"), expr("max(a)"), expr("avg(a)"), expr("sum(a)")]

df.agg(
  map_from_arrays(
    array(agg_keys), 
    array(agg_values)
  ).alias("aggs")
) \
.select(explode("aggs")) \
.selectExpr("key as col", *aggs_select_expr) \
.show(10, False)

# +----+------+------+------+-------+
# |col |min   |max   |avg   |sum    |
# +----+------+------+------+-------+
# |cola|1.0   |4.0   |2.5   |10.0   |
# |colb|1.5   |5.3   |2.8   |11.2   |
# |colc|3000.0|5000.0|4125.0|16500.0|
# +----+------+------+------+-------+

Description: with expression array(*[expr(f"{a}({c})") for a in aggs]) we create an array that contains all the aggregations of the current column. Each item of the generated array is being evaluated with the statement expr(f"{a}({c})" this will produce i.e expr("min(a)").

The array will consist the values of agg_values which together with agg_keys will compose our final map through the expression map_from_arrays(array(agg_keys), array(agg_values)). This is how the structure of map looks like:

map(
    cola -> [min(cola), max(cola), avg(cola), sum(cola)]
    colb -> [min(colb), max(colb), avg(colb), sum(colb)]
    colc -> [min(cola), max(colc), avg(cola), sum(colc)]
)

In order to extract the information that we need we must explode the previous map with explode("aggs") this will create two columns key and value which we use in our select statement.

aggs_select_expr will contain values in the form of ["value[0] as min", "value[1] as max", "value[2] as avg", "value[3] as sum"] which will be the input of the selectExpr statememnt.



来源:https://stackoverflow.com/questions/60342610/spark-aggregations-where-output-columns-are-functions-and-rows-are-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!