Difference between batch interval, sliding interval and window size in spark streaming

后端 未结 1 512
囚心锁ツ
囚心锁ツ 2021-02-06 02:48

I am new spark streaming. I understood window size needs to be a multiple of the batch interval. But how does the sliding interval work? If i have 3 as window size and 2 as slid

相关标签:
1条回答
  • 2021-02-06 03:37

    Here is a link to a documentation.

    Let's walk through these concepts:

    1. batch interval - it is time in seconds how long data will be collected before dispatching processing on it. For example if you set batch interval 5 seconds - Spark Streaming will collect data for 5 seconds and then kick out calculation on RDD with that data.
    2. window size - it is interval of time in seconds for how much historical data shall be contained in RDD before processing. For example you have 1 second batch interval and window size of 2 - in that case you will have calculation kicked out each second for 2 previous batches. E.g at time=3 you will have data from batch at time=2 and time=3.
    3. sliding interval - is amount of time in seconds for how much the window will shift. For example in previous example sliding interval is 1 (since calculation is kicked out each second) e.g. at time=1, time=2, time=3... if you set sliding interval=2, you will get calculation at time=1, time=3, time=5...

    You can refer to image above where window size is 3 times of batch interval and sliding window is 2 times of batch interval.

    To answer a question why window and sliding intervals shall be multiple of batch interval - it is because otherwise your window will end inbetween batch.

    If you have 3 as window size and 2 as sliding interval (see image) - yes, your word count will overlap. Basically you use window when you want to calculate something for some limited time - like actual news or tweets or whatever, when you don't need all historical data for the analysis.

    0 讨论(0)
提交回复
热议问题