Multi threading database reading

后端 未结 3 557
星月不相逢
星月不相逢 2021-02-04 22:16

In our Java application I have requirement to read the around 80 million records from oracle database. I am trying to redesign the multithreading program for this. Currently we

3条回答
  •  深忆病人
    2021-02-04 22:49

    Network

    First of all, since using rowid and rownum is vendor-lock anyway, you should consider using database stored routines. It could significantly reduce overhead of transmitting data from database to the application server (especially if they are on different machines and connected through network).

    Considering that you have 80 million records to transmit, that could be the best performance boost for you, though it depends on kind of work your threads do.

    Obviously increasing bandwidth would also help to solve networking issues.

    Disk performance

    Before making changes in code check the hard drive load while tasks running, perhaps it just can't handle that much I/O (10 threads reading simultaneously).

    Migrating to SSD/RAID or clustering database might solve the issue. While changing the way you access database won't in that case.

    Multithreading could solve CPU problems, but databases mostly depend on disk system.

    Rownum

    There are a couple of problems you might face if you will be implementing it using rowid and rownum.

    1) rownum is generated on the fly for each query's results. So if query doesn't have explicit sorting and it is possible that some record have different rownum every time you run query.

    For example you run it first time and get results like this:

    some_column | rownum
    ____________|________
         A      |    1
         B      |    2
         C      |    3
    

    then you run it second time, since you don't have explicit sorting, dbms (for some reason known to itself) decides to return results like this:

    some_column | rownum
    ____________|________
         C      |    1
         A      |    2
         B      |    3
    

    2) point 1 also implies that if you will be filtering results on rownum it will generate temporary table with ALL results and then filter it

    So rownum is not a good choice for splitting results. While rowid seemed better, it has some issues too.

    Rowid

    If you look at the ROWID description you may notice that "rowid value uniquely identifies a row in the database".

    Because of that and the fact that when you delete a row you have a "hole" in rowid sequence, rowids may be distributed not equally among table records.

    So for example if you have three threads and each fetching 1'000'000 rowids, it is possible that one will get 1'000'000 records and other two 1 record each. So one will be overwhelmed, while two other starving.

    It might be not a big deal in your case, though it very well might be the problem you are facing currently with primary key pattern.

    Or if you first fetch all rowids in dispatcher and then divide them equally (like peter.petrov suggested) that could do the thing, though fetching 80 million ids still sounds like a lot, I think it would be better to do the splitting with one sql-query that returns borders of chunks.

    Or you might solve that problem by giving low amount of rowids per task and using Fork-Join framework introduced in Java 7, however it should be used carefully.

    Also obvious point: both rownum and rowid are not portable across databases.

    So it is much better to have your own "sharding" column but then you will have to make sure yourself that it splits records in more or less equal chunks.


    Also keep in mind that if you are going to do it in several threads it is important to check what locking mode database uses, perhaps it just locks the table for every access, then multithreading is pointless.

    As others suggested, you'd better first find what is the main reason of low performance (network, disk, database locking, thread starvation or maybe you just have suboptimal queries - check the query plans).

提交回复
热议问题