What is the difference between mini-batch vs real time streaming in practice (not theory)?

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batch with epsilon time frame (say one millisecond) or I would like to understand reason why one would be an effective solution than other?

I recently came across one example where mini-batch (Apache Spark) is used for Fraud detection and real time streaming (Apache Flink) used for Fraud Prevention. Someone also commented saying mini-batches would not be an effective solution for fraud prevention (since the goal is to prevent the transaction from occurring as it happened) Now I wonder why this wouldn't be so effective with mini batch (Spark) ? Why is it not effective to run mini-batch with 1 millisecond latency? Batching is a technique used everywhere including the OS and the Kernel TCP/IP stack where the data to the disk or network are indeed buffered so what is the convincing factor here to say one is more effective than other?

Fabian Hueske

Disclaimer: I'm a committer and PMC member of Apache Flink. I'm familiar with the overall design of Spark Streaming but do not know its internals in detail.

The mini-batch stream processing model as implemented by Spark Streaming works as follows:

Records of a stream are collected in a buffer (mini-batch).
Periodically, the collected records are processed using a regular Spark job. This means, for each mini-batch a complete distributed batch processing job is scheduled and executed.
While the job runs, the records for the next batch are collected.

So, why is it not effective to run a mini-batch every 1ms? Simply because this would mean to schedule a distributed batch job every millisecond. Even though Spark is very fast in scheduling jobs, this would be a bit too much. It would also significantly reduce the possible throughput. Batching techniques used in OSs or TCP do also not work well if their batches become too small.

I know that one answer was accepted, but I think one more must be said to answer this question fully. I think answer like "Flink's real time is faster/better for streaming" is wrong, because it heavily depends what you want to do.

Spark mini-batch model has - as it was written in previous answer - disadvantage, that for each mini-batch there must be new job created.

However, Spark Structured Streaming has default processing time trigger is set to 0, that means reading new data is done as fast as possible. It means that:

one query starts
data arrived, but 1st query didn't end
1st query ended, so data will be immediatelly processed.

Latency is very small in such cases.

One big advantage over Flink is that Spark has unified APIs for batch and streaming processing, because of this mini-batch model. You can easily translate batch job to streaming job, join streaming data with old data from batch. Doing it with Flink is not possible. Flink also doesn't allow you to do interactive queries with data you've received.

As said before, use cases are different for micro-batches and real-time streaming:

For very very small latencies, Flink or some computional Grids, like Apache Ignite, will be good. They are suitable for processing with very low latency, but not with very complex computations.
For medium and larger latencies, Spark will have more unified API that will allow to do more complex computations in the same way that batch jobs are done, just because of this unification

For more details about Structured Streaming please look at this blog post

This is something I think a lot about, because the answer to technical and non-technical people is always hard to formulate.

I will try to answer to this part:

Why is it not effective to run mini-batch with 1 millisecond latency?

I believe the problem is not on the model itself but on how Spark implements it. It is empirical evidence that reducing the mini-batch window too much, performances degrade. In fact there was a suggested time of at least 0.5 seconds or more to prevent this kind of degradation. On big volumes even this window size was too small. I never had the chance to test it in production but I never had a strong real-time requirement.

I know Flink better than Spark so I don't really know about its internals that well but I believe the overhead introduced in the designing of the batch process were irrelevant if your batch takes at least a few seconds to be processed but becomes heavy if they introduce a fixed latency and you can't go below that. To understand the nature of these overheads I think you will have to dig in the Spark documentation, code and open issues.

The industry right now acknowledged that there is a need for a different model and that's why many "streaming-first" engines are growing right now, with Flink as the front runner. I don't think it's just buzzwords and hype, because the use cases for this kind of technology, at least for now, are extremely limited. Basically if you need to take an automatized decision in real time on big, complex data, you need a real-time fast data engine. In any other case, including near-real-time, real-time streaming is an overkill and mini-batch is fine.

来源：https://stackoverflow.com/questions/39715803/what-is-the-difference-between-mini-batch-vs-real-time-streaming-in-practice-no

标签

apache-spark

batch-processing

apache-flink

data-processing

stream-processing