Spark aggregations where output columns are functions and rows are columns

问题

I want to compute a bunch of different agg functions on different columns in a dataframe.

I know I can do something like this, but the output is all one row.

df.agg(max("cola"), min("cola"), max("colb"), min("colb"))

Let's say I will be performing 100 different aggregations on 10 different columns.

I want the output dataframe to be like this -

      |Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc..
cola  | 1 | 10| ... 
colb  | 2 | NULL| ... 
colc  | 5 | 20| ... 
cold  | NULL | 42| ... 
...

Where my rows are each column I am performing aggregations on and my columns are the aggregation functions. Some areas will be null if I don't calculated colb max for example.

How can I accomplish this?

回答1:

You can create a Map column, say Metrics, where keys are column names and values a struct of aggregations (max, min, avg, ...). I am using map_from_entries function to create a map column (available from Spark 2.4+). And then, simply explode the map to get the structure you want.

Here is an example you can adapt for your requirement:

df = spark.createDataFrame([("A", 1, 2), ("B", 2, 4), ("C", 5, 6), ("D", 6, 8)], ['cola', 'colb', 'colc'])

agg = map_from_entries(array(
    *[
        struct(lit(c),
               struct(max(c).alias("Max"), min(c).alias("Min"))
               )
        for c in df.columns
    ])).alias("Metrics")

df.agg(agg).select(explode("Metrics").alias("col", "Metrics")) \
    .select("col", "Metrics.*") \
    .show()

#+----+---+---+
#|col |Max|Min|
#+----+---+---+
#|cola|D  |A  |
#|colb|6  |1  |
#|colc|8  |2  |
#+----+---+---+

回答2:

Here is one solution which allows you to set aggregations dynamically from a predefined list. The solution uses map_from_arrays among others therefore is compatible with Spark >= 2.4.0:

from pyspark.sql.functions import lit, expr, array, map_from_arrays

df = spark.createDataFrame([
  [1, 2.3, 5000],
  [2, 5.3, 4000],
  [3, 2.1, 3000],
  [4, 1.5, 4500]
], ["cola", "colb", "colc"])

aggs = ["min", "max", "avg", "sum"]
aggs_select_expr = [f"value[{idx}] as {agg}" for idx, agg in enumerate(aggs)]

agg_keys = []
agg_values = []

# generate map here where key is col name and value an array of aggregations
for c in df.columns:
  agg_keys.append(lit(c)) # the key i.e cola
  agg_values.append(array(*[expr(f"{a}({c})") for a in aggs])) # the value i.e [expr("min(a)"), expr("max(a)"), expr("avg(a)"), expr("sum(a)")]

df.agg(
  map_from_arrays(
    array(agg_keys), 
    array(agg_values)
  ).alias("aggs")
) \
.select(explode("aggs")) \
.selectExpr("key as col", *aggs_select_expr) \
.show(10, False)

# +----+------+------+------+-------+
# |col |min   |max   |avg   |sum    |
# +----+------+------+------+-------+
# |cola|1.0   |4.0   |2.5   |10.0   |
# |colb|1.5   |5.3   |2.8   |11.2   |
# |colc|3000.0|5000.0|4125.0|16500.0|
# +----+------+------+------+-------+

Description: with expression array(*[expr(f"{a}({c})") for a in aggs]) we create an array that contains all the aggregations of the current column. Each item of the generated array is being evaluated with the statement expr(f"{a}({c})" this will produce i.e expr("min(a)").

The array will consist the values of agg_values which together with agg_keys will compose our final map through the expression map_from_arrays(array(agg_keys), array(agg_values)). This is how the structure of map looks like:

map(
    cola -> [min(cola), max(cola), avg(cola), sum(cola)]
    colb -> [min(colb), max(colb), avg(colb), sum(colb)]
    colc -> [min(cola), max(colc), avg(cola), sum(colc)]
)

In order to extract the information that we need we must explode the previous map with explode("aggs") this will create two columns key and value which we use in our select statement.

aggs_select_expr will contain values in the form of ["value[0] as min", "value[1] as max", "value[2] as avg", "value[3] as sum"] which will be the input of the selectExpr statememnt.

来源：https://stackoverflow.com/questions/60342610/spark-aggregations-where-output-columns-are-functions-and-rows-are-columns

标签

python

apache-spark

pyspark

pyspark-sql

pyspark-dataframes