问题
Take word count for example, when the application startup and long runs, and receive a word "Spark"
, then in the result table, there is a row (Spark,1),
After the application has been running for 1 day or even one week, the application receives "Spark"
again, so that the result table should have a row (spark,2).
I am just using above scenario to raise the question: How the unbounded table keeps the state of the data it receives,since the state could be super huge after the application runs for a long time.
Also, when using "Complete"
output mode, if the resulting table is very large, then write out all the data in resulting table to sink will be very time expensive
回答1:
To avoid this huge amount of data in memory spark structured streaming uses watermarks. The main idea is to store in memory only data within specific time window. All the data outside this window are stored in file system. You can read about watermarks here or here
来源:https://stackoverflow.com/questions/47175547/how-does-unbound-table-work-in-spark-structured-streaming