How does unbound table work in spark structured streaming

问题

Take word count for example, when the application startup and long runs, and receive a word "Spark", then in the result table, there is a row (Spark,1),

After the application has been running for 1 day or even one week, the application receives "Spark" again, so that the result table should have a row (spark,2).

I am just using above scenario to raise the question: How the unbounded table keeps the state of the data it receives,since the state could be super huge after the application runs for a long time.

Also, when using "Complete" output mode, if the resulting table is very large, then write out all the data in resulting table to sink will be very time expensive

回答1:

To avoid this huge amount of data in memory spark structured streaming uses watermarks. The main idea is to store in memory only data within specific time window. All the data outside this window are stored in file system. You can read about watermarks here or here

来源：https://stackoverflow.com/questions/47175547/how-does-unbound-table-work-in-spark-structured-streaming

标签

apache-spark

spark-streaming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!