Append only new aggregates based on groupby keys

问题

I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table:

operations\
        .repartition('date')\
        .writeStream\
        .outputMode('append')\
        .trigger(once=True)\
        .option("checkpointLocation", "/mnt/sandbox/operations/_chk")\
        .format('delta')\
        .partitionBy('date')\
        .start('/mnt/sandbox/operations')

This is working fine, but i need to summarize this information grouped by (date,client_id), so i created another streaming from this operations table to a new table:

summarized= spark.readStream.format('delta').load('/mnt/sandbox/operations')

summarized= summarized.groupBy('client_id','date').agg(<a lot of aggs>)

summarized.repartition('date')\
        .writeStream\
        .outputMode('complete')\
        .trigger(once=True)\
        .option("checkpointLocation", "/mnt/sandbox/summarized/_chk")\
        .format('delta')\
        .partitionBy('date')\
        .start('/mnt/sandbox/summarized')

This is working, but every time I got new data into operations table, spark recalculates summarized all over again. I tried to use the append mode on the second streaming, but it need watermarks, and the date is DateType.

There is a way to only calculate new aggregates based on the group keys and append them on the summarized?

回答1:

You need to use Spark Structured Streaming - Window Operations

When you use windowed operations, it will do the bucketing according to windowDuration and slideDuration. windowDuration tells you what is the length of the window, and slideDuration tells by how much time should you slide the window.

If you groupby using window() [docs], you will get a resultant window column along with other columns you groupby with like client_id

For example:

windowDuration = "10 minutes"
slideDuration = "5 minutes"
summarized = before_summary.groupBy(before_summary.client_id,
    window(before_summary.date, windowDuration, slideDuration)
).agg(<a lot of aggs>).orderBy('window')

来源：https://stackoverflow.com/questions/58100079/append-only-new-aggregates-based-on-groupby-keys

标签

apache-spark

pyspark

spark-structured-streaming

azure-databricks

delta-lake