Difference between batch interval, sliding interval and window size in spark streaming

后端未结

关注

 1  509

囚心锁ツ 2021-02-06 02:48

I am new spark streaming. I understood window size needs to be a multiple of the batch interval. But how does the sliding interval work? If i have 3 as window size and 2 as slid

1条回答

执笔经年 (楼主)

2021-02-06 03:37
Here is a link to a documentation.

Let's walk through these concepts:
1. batch interval - it is time in seconds how long data will be collected before dispatching processing on it. For example if you set batch interval 5 seconds - Spark Streaming will collect data for 5 seconds and then kick out calculation on RDD with that data.
2. window size - it is interval of time in seconds for how much historical data shall be contained in RDD before processing. For example you have 1 second batch interval and window size of 2 - in that case you will have calculation kicked out each second for 2 previous batches. E.g at time=3 you will have data from batch at time=2 and time=3.
3. sliding interval - is amount of time in seconds for how much the window will shift. For example in previous example sliding interval is 1 (since calculation is kicked out each second) e.g. at time=1, time=2, time=3... if you set sliding interval=2, you will get calculation at time=1, time=3, time=5...
You can refer to image above where window size is 3 times of batch interval and sliding window is 2 times of batch interval.

To answer a question why window and sliding intervals shall be multiple of batch interval - it is because otherwise your window will end inbetween batch.

If you have 3 as window size and 2 as sliding interval (see image) - yes, your word count will overlap. Basically you use window when you want to calculate something for some limited time - like actual news or tweets or whatever, when you don't need all historical data for the analysis.
0 讨论(0)
发布评论:

提交评论
- 加载中...