streaming aggregate not writing into sink

问题

I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table:

operations\
        .repartition('date')\
        .writeStream\
        .outputMode('append')\
        .trigger(once=True)\
        .option("checkpointLocation", "/mnt/sandbox/operations/_chk")\
        .format('delta')\
        .partitionBy('date')\
        .start('/mnt/sandbox/operations')

This is working fine, but i need to summarize this information grouped by (date,client_id), so i created another streaming from this operations table to a new table. So i tried to convert my date field to a timestamp, so i could use append mode while writing an aggregated stream:

import pyspark.sql.functions as F

summarized= spark.readStream.format('delta').load('/mnt/sandbox/operations')
summarized= summarized.withColumn('timestamp_date',F.to_timestamp('date'))
summarized= summarized.withWatermark('timestamp_date','1 second').groupBy('client_id','date','timestamp_date').agg(<lot of aggs>)

summarized\
        .repartition('date')\
        .writeStream\
        .outputMode('append')\
        .option("checkpointLocation", "/mnt/sandbox/summarized/_chk")\
        .trigger(once=True)\
        .format('delta')\
        .partitionBy('date')\
        .start('/mnt/sandbox/summarized')

This code runs, but it does not write anything in the sink.

why it isn't writing results into sink?

回答1:

There could be two issues at play here.

Malformed Date Input

I'm quite sure that the issue is with F.to_timestamp('date') that gives null due to malformed input.

If so, withWatermark('timestamp_date','1 second') can never be "materialized" and triggers no output.

Could you spark.read.format('delta').load('/mnt/sandbox/operations') (to read not to readStream) and see if the conversion gives proper values?

spark.\
  read.\ 
  format('delta').\
  load('/mnt/sandbox/operations').\
  withColumn('timestamp_date',F.to_timestamp('date')).\
  show

All Rows Use Same Timestamp

It is also possible that withWatermark('timestamp_date','1 second') does not finishes (and so "completes" an aggregation) because all rows are from the same timestamp so the time does not advance.

You should have rows with different timestamps so the notion of time per the timestamp_date can get past the '1 second' lateness window.

来源：https://stackoverflow.com/questions/58135188/streaming-aggregate-not-writing-into-sink

标签

pyspark

spark-structured-streaming

azure-databricks

delta-lake