问题
I want to compute a bunch of different agg functions on different columns in a dataframe.
I know I can do something like this, but the output is all one row.
df.agg(max("cola"), min("cola"), max("colb"), min("colb"))
Let's say I will be performing 100 different aggregations on 10 different columns.
I want the output dataframe to be like this -
|Min|Max|AnotherAggFunction1|AnotherAggFunction2|...etc..
cola | 1 | 10| ...
colb | 2 | NULL| ...
colc | 5 | 20| ...
cold | NULL | 42| ...
...
Where my rows are each column I am performing aggregations on and my columns are the aggregation functions. Some areas will be null if I don't calculated colb
max for example.
How can I accomplish this?
回答1:
You can create a Map column, say Metrics
, where keys are column names and values a struct of aggregations (max, min, avg, ...). I am using map_from_entries function to create a map column (available from Spark 2.4+). And then, simply explode the map to get the structure you want.
Here is an example you can adapt for your requirement:
df = spark.createDataFrame([("A", 1, 2), ("B", 2, 4), ("C", 5, 6), ("D", 6, 8)], ['cola', 'colb', 'colc'])
agg = map_from_entries(array(
*[
struct(lit(c),
struct(max(c).alias("Max"), min(c).alias("Min"))
)
for c in df.columns
])).alias("Metrics")
df.agg(agg).select(explode("Metrics").alias("col", "Metrics")) \
.select("col", "Metrics.*") \
.show()
#+----+---+---+
#|col |Max|Min|
#+----+---+---+
#|cola|D |A |
#|colb|6 |1 |
#|colc|8 |2 |
#+----+---+---+
回答2:
Here is one solution which allows you to set aggregations dynamically from a predefined list. The solution uses map_from_arrays among others therefore is compatible with Spark >= 2.4.0:
from pyspark.sql.functions import lit, expr, array, map_from_arrays
df = spark.createDataFrame([
[1, 2.3, 5000],
[2, 5.3, 4000],
[3, 2.1, 3000],
[4, 1.5, 4500]
], ["cola", "colb", "colc"])
aggs = ["min", "max", "avg", "sum"]
aggs_select_expr = [f"value[{idx}] as {agg}" for idx, agg in enumerate(aggs)]
agg_keys = []
agg_values = []
# generate map here where key is col name and value an array of aggregations
for c in df.columns:
agg_keys.append(lit(c)) # the key i.e cola
agg_values.append(array(*[expr(f"{a}({c})") for a in aggs])) # the value i.e [expr("min(a)"), expr("max(a)"), expr("avg(a)"), expr("sum(a)")]
df.agg(
map_from_arrays(
array(agg_keys),
array(agg_values)
).alias("aggs")
) \
.select(explode("aggs")) \
.selectExpr("key as col", *aggs_select_expr) \
.show(10, False)
# +----+------+------+------+-------+
# |col |min |max |avg |sum |
# +----+------+------+------+-------+
# |cola|1.0 |4.0 |2.5 |10.0 |
# |colb|1.5 |5.3 |2.8 |11.2 |
# |colc|3000.0|5000.0|4125.0|16500.0|
# +----+------+------+------+-------+
Description: with expression array(*[expr(f"{a}({c})") for a in aggs])
we create an array that contains all the aggregations of the current column. Each item of the generated array is being evaluated with the statement expr(f"{a}({c})"
this will produce i.e expr("min(a)")
.
The array will consist the values of agg_values
which together with agg_keys
will compose our final map through the expression map_from_arrays(array(agg_keys), array(agg_values))
. This is how the structure of map looks like:
map(
cola -> [min(cola), max(cola), avg(cola), sum(cola)]
colb -> [min(colb), max(colb), avg(colb), sum(colb)]
colc -> [min(cola), max(colc), avg(cola), sum(colc)]
)
In order to extract the information that we need we must explode the previous map with explode("aggs")
this will create two columns key
and value
which we use in our select statement.
aggs_select_expr
will contain values in the form of ["value[0] as min", "value[1] as max", "value[2] as avg", "value[3] as sum"]
which will be the input of the selectExpr
statememnt.
来源:https://stackoverflow.com/questions/60342610/spark-aggregations-where-output-columns-are-functions-and-rows-are-columns