What is the difference between mini-batch vs real time streaming in practice (not theory)?

前端 未结 3 869
情歌与酒
情歌与酒 2021-01-31 10:29

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame w

3条回答
  •  星月不相逢
    2021-01-31 11:09

    I know that one answer was accepted, but I think one more must be said to answer this question fully. I think answer like "Flink's real time is faster/better for streaming" is wrong, because it heavily depends what you want to do.

    Spark mini-batch model has - as it was written in previous answer - disadvantage, that for each mini-batch there must be new job created.

    However, Spark Structured Streaming has default processing time trigger is set to 0, that means reading new data is done as fast as possible. It means that:

    1. one query starts
    2. data arrived, but 1st query didn't end
    3. 1st query ended, so data will be immediatelly processed.

    Latency is very small in such cases.

    One big advantage over Flink is that Spark has unified APIs for batch and streaming processing, because of this mini-batch model. You can easily translate batch job to streaming job, join streaming data with old data from batch. Doing it with Flink is not possible. Flink also doesn't allow you to do interactive queries with data you've received.

    As said before, use cases are different for micro-batches and real-time streaming:

    1. For very very small latencies, Flink or some computional Grids, like Apache Ignite, will be good. They are suitable for processing with very low latency, but not with very complex computations.
    2. For medium and larger latencies, Spark will have more unified API that will allow to do more complex computations in the same way that batch jobs are done, just because of this unification

    For more details about Structured Streaming please look at this blog post

提交回复
热议问题