In our Java application I have requirement to read the around 80 million records from oracle database. I am trying to redesign the multithreading program for this. Currently we
As noted in the comments, multithreading might not help and may even make matters worse.
Standard alternatives for any query are:
SELECT COUNT...
there is nothing more you can do here!None of these are guaranteed to succeed. It would depend on what you are trying to do.
Create a dispatching thread which reads the PKs (the IDs) of N rows. Here you can do some sort of caching - read N=1000 rows, give them to Worker1, read next N=1000 rows, give them to Worker2, etc. This way, you don't need to keep more than N=1000 IDs (PKs) in-memory in the dispatcher thread. Once you pass the work (the work is these N=1000 IDs) to the Worker thread, you dispose them in the dispatcher thread (no need to keep them).
Each worker thread takes its N (e.g. 1000) PKs/IDs and using them gets the rows from the DBs. Make sure here that you use rowlock (T-SQL) or its equivalent if you're not using SQL Server. This way, the threads will not get into each other's way. So worker reads N rows from the DB and processes them. Once complete it may signal the dispatcher (something like "I am done" event).
This is the initial idea that comes to my mind. I guess it could be refined further if you think some more on it.
First of all, since using rowid and rownum is vendor-lock anyway, you should consider using database stored routines. It could significantly reduce overhead of transmitting data from database to the application server (especially if they are on different machines and connected through network).
Considering that you have 80 million records to transmit, that could be the best performance boost for you, though it depends on kind of work your threads do.
Obviously increasing bandwidth would also help to solve networking issues.
Before making changes in code check the hard drive load while tasks running, perhaps it just can't handle that much I/O (10 threads reading simultaneously).
Migrating to SSD/RAID or clustering database might solve the issue. While changing the way you access database won't in that case.
Multithreading could solve CPU problems, but databases mostly depend on disk system.
There are a couple of problems you might face if you will be implementing it using rowid and rownum.
1) rownum is generated on the fly for each query's results. So if query doesn't have explicit sorting and it is possible that some record have different rownum every time you run query.
For example you run it first time and get results like this:
some_column | rownum
____________|________
A | 1
B | 2
C | 3
then you run it second time, since you don't have explicit sorting, dbms (for some reason known to itself) decides to return results like this:
some_column | rownum
____________|________
C | 1
A | 2
B | 3
2) point 1 also implies that if you will be filtering results on rownum it will generate temporary table with ALL results and then filter it
So rownum is not a good choice for splitting results. While rowid seemed better, it has some issues too.
If you look at the ROWID description you may notice that "rowid value uniquely identifies a row in the database".
Because of that and the fact that when you delete a row you have a "hole" in rowid sequence, rowids may be distributed not equally among table records.
So for example if you have three threads and each fetching 1'000'000 rowids, it is possible that one will get 1'000'000 records and other two 1 record each. So one will be overwhelmed, while two other starving.
It might be not a big deal in your case, though it very well might be the problem you are facing currently with primary key pattern.
Or if you first fetch all rowids in dispatcher and then divide them equally (like peter.petrov suggested) that could do the thing, though fetching 80 million ids still sounds like a lot, I think it would be better to do the splitting with one sql-query that returns borders of chunks.
Or you might solve that problem by giving low amount of rowids per task and using Fork-Join framework introduced in Java 7, however it should be used carefully.
Also obvious point: both rownum and rowid are not portable across databases.
So it is much better to have your own "sharding" column but then you will have to make sure yourself that it splits records in more or less equal chunks.
Also keep in mind that if you are going to do it in several threads it is important to check what locking mode database uses, perhaps it just locks the table for every access, then multithreading is pointless.
As others suggested, you'd better first find what is the main reason of low performance (network, disk, database locking, thread starvation or maybe you just have suboptimal queries - check the query plans).