Multiple aggregations in Spark Structured Streaming

前端 未结 8 1730
难免孤独
难免孤独 2020-12-09 04:09

I would like to do multiple aggregations in Spark Structured Streaming.

Something like this:

  • Read a stream of input files (from a folder)
  • Per
8条回答
  •  时光说笑
    2020-12-09 04:20

    TLDR - this is not supported; in some cases workarounds are possible.

    Longer version -

    1. (a hack)

    In some cases workarounds are possible, for example, if you'd like to have multiple count(distinct) in a streaming query on a low-cardinality columns, then it's easy for approx_count_distinct to actually return exact number of distinct elements by putting rsd argument low enough (that's the 2nd optional argument for approx_count_distinct, by default that's 0.05).

    How is "low-cardinality" defined here? I don't recommend to rely on this approach for columns that can have more than 1000 unique values.

    So in your streaming query you can do something like this -

    (spark.readStream....
          .groupBy("site_id")
          .agg(approx_count_distinct("domain", 0.001).alias("distinct_domains")
             , approx_count_distinct("country", 0.001).alias("distinct_countries")
             , approx_count_distinct("language", 0.001).alias("distinct_languages")
          )
      )
    

    Here's proof it actually works:

    Notice that count(distinct) and count_approx_distinct give the same results! Here's some guidance on rsd argument count_approx_distinct:

    • for a column with 100 distinct values rsd of 0.02 was necessary;
    • for a column with 1000 distinct values rsd of 0.001 was necessary.

    PS. Also notice that I had to comment out the experiment on a column with 10k distinct values as I didn't have enough patience for that to complete. That's why I mentioned you should not use this hack for columns with over 1k distinct values. For approx_count_distinct to match exact count(distinct) on over 1k distinct values would require rsd way too low for what HyperLogLogPlusPlus algorithm was designed for (this algorithm is behind approx_count_distinct implementation).

    1. (nice but more involving way)

    As somebody else mentioned, you can use Spark's arbitrary stateful streaming to implement your own aggregates; and as many of aggregations as necessary on a single stream using [flat]MapWithGroupState. And this would a legit and supported way to do it unlike the above hack that only works in some cases. This method is only available for Spark Scala API and not available for PySpark.

    1. (perhaps this will be a long-term solution one day)

    A proper way be to show some support for native multiple aggregation in Spark Streaming - https://github.com/apache/spark/pull/23576 -- vote up on this SPARK jira/ PR and show your support if you're interested in this.

提交回复
热议问题