PostgreSQL how split a query between multiple CPU

后端 未结 2 396
南旧
南旧 2021-01-22 04:15

I have a store procedure

DO_STUFF(obj rowFromMyTable) 

This take obj and process some data and save the result in an independent table. So

相关标签:
2条回答
  • 2021-01-22 04:44

    Two ways to do this (works in any of Windows / Linux / Mac):

    • PostgreSQL 9.6+ should now be able to (automatically) parallelize your queries to some extent and then you may want to see whether you need to take the pain to split the queries yourself at all.

    • Use dblink and connect to the database via multiple callbacks. The best part about DBLink is that these can be fire-n-forget (i.e. asynchronous) calls and so can be called in quick succession and then eventually wait till they all complete (although you'd need to weave the wait-for-result logic yourself). However, the drawback (as is with synchronous calls) is that unless you keep track of things like process failures / timeouts etc. you may wrongly assume that since the calls went through (successfully) all data was processed, where actually its possible that some calls failed (asynchronously).

    Example

    SELECT * FROM dblink_send_query('testconn', 'SELECT do_stuff_wrapper(0, 5000)') AS t1;
    SELECT dblink_is_busy('testconn');
    SELECT * FROM dblink_get_result('testconn') AS t1(c1 TEXT, c2 TEXT, ....);
    

    Update: Exemplify using dblink's asynchronous functions.

    0 讨论(0)
  • 2021-01-22 04:50

    A technique I like to use to get quick multi-threading for queries is to use a combination of psql and GNU Parallel (http://www.gnu.org/software/parallel/parallel_tutorial.html) to allow for multiple psql commands to be run at once.

    If you create a wrapper stored procedure containing the loop and add arguments to it to take an offset and a limit, you can then create a quick bash script (or Python, Perl, et al) to generate the series of psql commands that are needed.

    The file containing the commands can be piped into parallel and either take all the CPUs available, or a number you determine (I often like to use 4 CPUs, so as to also keep a lid on I/O on the box, but it would depend on the hardware you have).

    Let's say the wrapper is called do_stuff_wrapper(_offset, _limit). The offset and limit would apply to the select:

    select obj from tblSOURCE offset _offset limit _limit
    

    Your generated psql command file (let's call it parallel.dat) might look something like this:

    psql -X -h HOST -U user database -c "select do_stuff_wrapper(0, 5000);"
    psql -X -h HOST -U user database -c "select do_stuff_wrapper(5001, 5000);"
    psql -X -h HOST -U user database -c "select do_stuff_wrapper(10001, 5000);"
    

    and so on.

    Then you can run the commands like this:

    cat parallel.dat | parallel -j 4 {}

    To get multiple psql commands running in concert. Parallel will also pipeline the IO (if any, such as NOTICE's, etc.) for you such that it ends up in command order.

    Edit: If you're running on Windows, you could perhaps install Cygwin, and then use parallel from there. Another, pure-Windows option would be to look into Powershell to accomplish something akin to parallel (see Can Powershell Run Commands in Parallel?).

    0 讨论(0)
提交回复
热议问题