I cannot understand the difference between multi-threading and partitioning in Spring batch. The implementation is of course different: In partitioning you need to prepare the p
TL;DR;
Neither approach is intended to help when the bottleneck is in the processor. You will see some gains by having multiple items going through a processor at the same time, but both of the options you point out get their full benefits when used in processes that are I/O bound. The AsyncItemProcessor
/AsyncItemWriter
may be a better option.
Overview of Spring Batch Scalability
There are five options for scaling Spring Batch jobs:
AsyncItemProcessor
/AsyncItemWriter
Each has it's own benefits and disadvantages. Let's walk through each:
Multithreaded step
A multithreaded step takes a single step and executes each chunk within that step on a separate thread. This means that the same instances of each of the batch components (readers, writers, etc) are shared across the threads. This can increase performance by adding some parallelism to the step at the cost of restartability in most cases. You sacrifice restartability because in most cases, the ability to restart is based on the state maintained within the reader/writer/etc. With multiple threads updating that state, it becomes invalid and useless for restart. Because of this, you typically need to turn save state off on individual components and set the restartable flag to false on the job.
Parallel steps
Parallel steps are achieved via a split. It allows you to execute multiple, independent steps in parallel via threads. This does not sacrifice restartability, but does not help improve the performance of a single step or piece of business logic.
Partitioning
Partitioning is the dividing of data, in advance, into smaller chunks (called partitions) by a master step and then having slaves work independently on the partitions. In Spring Batch, both the master and each slave, is an independent step so you can get the benefits of parallelism within a single step without sacrificing restartability. Partitioning also provides the ability to scale beyond a single JVM in that the slaves do not have to be local (you can use various communication mechanisms to communicate with remote slaves).
An important note about partitioning is that the only communication between the master and slave is a description of the data and not the data itself. For example, the master may tell slave1 to process records 1-100, slave2 to process records 101-200, etc. The master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. Because of this, the data must be local to the slave processes and the master can be located anywhere.
Remote chunking
Remote chunking allows you to scale the process and optionally the write logic across JVMs. In this use case, the master reads the data and then sends it over the wire to the slaves where it is processed and then either written locally to the slave or returned to the master for writing local to the master.
The important difference between partitioning and remote chunking is that instead of a description going over the wire, remote chunking sends the actual data over the wire. So instead of a single packet saying process records 1-100, remote chunking is going to send the actual records 1-100. This can have a large impact on the I/O profile of a step, but if the processor is enough of a bottleneck, this can be useful.
AsyncItemProcessor
/AsyncItemWriter
The final option for scaling Spring Batch processes is the AsyncItemProcessor
/AsycnItemWriter
combination. In this case, the AsyncItemProcessor
wraps your ItemProcessor
implementation and executes the call to your implementation in a separate thread. The AsyncItemProcessor
then returns a Future
that is passed to the AsyncItemWriter
where it is unwrapped and passed to the delegate ItemWriter
implementation.
Because of the nature of how data flows through this option, certain listener scenarios are not supported (since we don't know the outcome of the ItemProcessor
call until inside the ItemWriter
) but overall, it can provide a useful tool for parallelizing just the ItemProcessor
logic in a single JVM without sacrificing restartability.