What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame w
Disclaimer: I'm a committer and PMC member of Apache Flink. I'm familiar with the overall design of Spark Streaming but do not know its internals in detail.
The mini-batch stream processing model as implemented by Spark Streaming works as follows:
So, why is it not effective to run a mini-batch every 1ms? Simply because this would mean to schedule a distributed batch job every millisecond. Even though Spark is very fast in scheduling jobs, this would be a bit too much. It would also significantly reduce the possible throughput. Batching techniques used in OSs or TCP do also not work well if their batches become too small.