What is the difference between mini-batch vs real time streaming in practice (not theory)?

前端 未结 3 860
情歌与酒
情歌与酒 2021-01-31 10:29

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame w

3条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-01-31 11:13

    Disclaimer: I'm a committer and PMC member of Apache Flink. I'm familiar with the overall design of Spark Streaming but do not know its internals in detail.

    The mini-batch stream processing model as implemented by Spark Streaming works as follows:

    • Records of a stream are collected in a buffer (mini-batch).
    • Periodically, the collected records are processed using a regular Spark job. This means, for each mini-batch a complete distributed batch processing job is scheduled and executed.
    • While the job runs, the records for the next batch are collected.

    So, why is it not effective to run a mini-batch every 1ms? Simply because this would mean to schedule a distributed batch job every millisecond. Even though Spark is very fast in scheduling jobs, this would be a bit too much. It would also significantly reduce the possible throughput. Batching techniques used in OSs or TCP do also not work well if their batches become too small.

提交回复
热议问题