问题
I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id)
. So I created a Stream which append only new data into a delta table:
operations\
.repartition('date')\
.writeStream\
.outputMode('append')\
.trigger(once=True)\
.option("checkpointLocation", "/mnt/sandbox/operations/_chk")\
.format('delta')\
.partitionBy('date')\
.start('/mnt/sandbox/operations')
This is working fine, but i need to summarize this information grouped by (date,client_id)
, so i created another streaming from this operations table to a new table:
summarized= spark.readStream.format('delta').load('/mnt/sandbox/operations')
summarized= summarized.groupBy('client_id','date').agg(<a lot of aggs>)
summarized.repartition('date')\
.writeStream\
.outputMode('complete')\
.trigger(once=True)\
.option("checkpointLocation", "/mnt/sandbox/summarized/_chk")\
.format('delta')\
.partitionBy('date')\
.start('/mnt/sandbox/summarized')
This is working, but every time I got new data into operations
table, spark recalculates summarized
all over again. I tried to use the append mode on the second streaming, but it need watermarks, and the date is DateType.
There is a way to only calculate new aggregates based on the group keys and append them on the summarized
?
回答1:
You need to use Spark Structured Streaming - Window Operations
When you use windowed operations, it will do the bucketing according to windowDuration
and slideDuration
. windowDuration
tells you what is the length of the window, and slideDuration
tells by how much time should you slide the window.
If you groupby using window() [docs], you will get a resultant window
column along with other columns you groupby with like client_id
For example:
windowDuration = "10 minutes"
slideDuration = "5 minutes"
summarized = before_summary.groupBy(before_summary.client_id,
window(before_summary.date, windowDuration, slideDuration)
).agg(<a lot of aggs>).orderBy('window')
来源:https://stackoverflow.com/questions/58100079/append-only-new-aggregates-based-on-groupby-keys