My problem statement. Read a csv file with 10 million data and store it in db. with as minimal time as possible.
I had implemented it using Simple multi threaded
You can split your input file to many file , the use Partitionner and load small files with threads, but on error , you must restart all job after DB cleaned.
<batch:job id="transformJob">
<batch:step id="deleteDir" next="cleanDB">
<batch:tasklet ref="fileDeletingTasklet" />
</batch:step>
<batch:step id="cleanDB" next="split">
<batch:tasklet ref="countThreadTasklet" />
</batch:step>
<batch:step id="split" next="partitionerMasterImporter">
<batch:tasklet>
<batch:chunk reader="largeCSVReader" writer="smallCSVWriter" commit-interval="#{jobExecutionContext['chunk.count']}" />
</batch:tasklet>
</batch:step>
<batch:step id="partitionerMasterImporter" next="partitionerMasterExporter">
<partition step="importChunked" partitioner="filePartitioner">
<handler grid-size="10" task-executor="taskExecutor" />
</partition>
</batch:step>
</batch:job>
Full example code (on Github).
Hope this help.
Let me known if this help or how you solve problem because I'm interested for my (future) work!
I hope my indications can be helpful.
Here is how I solved the problem.
Read a file and chunk the file( split the file) using Buffered and File Channel reader and writer ( the fastest way of File read/write, even spring batch uses the same). I implemented such that this is executed before job is started( However it can be executed using job as step using method invoker)
Start the Job with directory location as job parameter.