I would like to do multiple aggregations in Spark Structured Streaming.
Something like this:
TLDR - this is not supported; in some cases workarounds are possible.
Longer version -
In some cases workarounds are possible, for example, if you'd like to have multiple count(distinct)
in a streaming query on a low-cardinality columns, then it's easy for approx_count_distinct
to actually return exact number of distinct elements by putting rsd
argument low enough (that's the 2nd optional argument for approx_count_distinct, by default that's 0.05
).
How is "low-cardinality" defined here? I don't recommend to rely on this approach for columns that can have more than 1000 unique values.
So in your streaming query you can do something like this -
(spark.readStream....
.groupBy("site_id")
.agg(approx_count_distinct("domain", 0.001).alias("distinct_domains")
, approx_count_distinct("country", 0.001).alias("distinct_countries")
, approx_count_distinct("language", 0.001).alias("distinct_languages")
)
)
Here's proof it actually works:
Notice that count(distinct) and count_approx_distinct
give the same results!
Here's some guidance on rsd
argument count_approx_distinct
:
rsd
of 0.02 was necessary;rsd
of 0.001 was necessary.PS. Also notice that I had to comment out the experiment on a column with 10k distinct values as I didn't have enough patience for that to complete. That's why I mentioned you should not use this hack for columns with over 1k distinct values. For approx_count_distinct
to match exact count(distinct) on over 1k distinct values would require rsd
way too low for what HyperLogLogPlusPlus algorithm was designed for (this algorithm is behind approx_count_distinct implementation).
As somebody else mentioned, you can use Spark's arbitrary stateful streaming to implement your own aggregates; and as many of aggregations as necessary on a single stream using [flat]MapWithGroupState. And this would a legit and supported way to do it unlike the above hack that only works in some cases. This method is only available for Spark Scala API and not available for PySpark.
A proper way be to show some support for native multiple aggregation in Spark Streaming - https://github.com/apache/spark/pull/23576 -- vote up on this SPARK jira/ PR and show your support if you're interested in this.