Optimal MySQL settings for queries that deliver large amounts of data?

后端 未结 1 2000
北海茫月
北海茫月 2020-11-27 07:05

I work as a scientist and I have used MySQL as a storage for the results of my numerical simulations. Typically I have a set of data obtained by my experiment and a control

相关标签:
1条回答
  • 2020-11-27 07:09

    Something has to be seriously wrong for your query to be taking 2 hours to execute when I can do the same thing in under 60 seconds on similar hardware.

    Some of the following might prove helpful...

    Tune MySQL for your engine

    Check your server configuration and optimise accordingly. Some of the following resources should be useful.

    • http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/
    • http://www.mysqlperformanceblog.com/
    • http://www.highperfmysql.com/
    • http://forge.mysql.com/wiki/ServerVariables
    • http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html
    • http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
    • http://jpipes.com/presentations/perf_tuning_best_practices.pdf
    • http://jpipes.com/presentations/index_coding_optimization.pdf
    • http://www.jasny.net/?p=36

    Now for the less obvious...

    Consider using a stored procedure to process the data server side

    Why not process all the data inside of MySQL so you don't have to send vast quantities of data to your application layer ? The following example uses a cursor to loop and process 50M rows server side in under 2 minutes. I'm not a huge fan of cursors, especially in MySQL where they are very limited, but I'm guessing you'd be looping the resultset and doing some form of numerical analysis so use of a cursor is justifiable in this case.

    Simplified myisam results table - keys based on yours.

    drop table if exists results_1mregr_c_ew_f;
    create table results_1mregr_c_ew_f
    (
    id int unsigned not null auto_increment primary key,
    rc tinyint unsigned not null,
    df int unsigned not null default 0,
    val double(10,4) not null default 0,
    ts timestamp not null default now(),
    key (rc, df)
    )
    engine=myisam;
    

    I generated 100M rows of data with the key fields having approximately the same cardinality as in your example:

    show indexes from results_1mregr_c_ew_f;
    
    Table                   Non_unique  Key_name    Seq_in_index    Column_name Collation   Cardinality Index_type
    =====                   ==========  ========    ============    =========== =========   =========== ==========
    results_1mregr_c_ew_f       0       PRIMARY         1               id          A       100000000   BTREE   
    results_1mregr_c_ew_f       1       rc              1               rc          A               2   BTREE   
    results_1mregr_c_ew_f       1       rc              2               df          A             223   BTREE   
    

    Stored procedure

    I created a simple stored procedure that fetches the required data and processes it (uses same where condition as your example)

    drop procedure if exists process_results_1mregr_c_ew_f;
    
    delimiter #
    
    create procedure process_results_1mregr_c_ew_f
    (
    in p_rc tinyint unsigned,
    in p_df int unsigned
    )
    begin
    
    declare v_count int unsigned default 0;
    declare v_done tinyint default 0;
    declare v_id int unsigned;
    declare v_result_cur cursor for select id from results_1mregr_c_ew_f where rc = p_rc and df > p_df;
    declare continue handler for not found set v_done = 1;
    
    open v_result_cur;
    
    repeat
        fetch v_result_cur into v_id;
    
        set v_count = v_count + 1;
        -- do work...
    
    until v_done end repeat;
    close v_result_cur;
    
    select v_count as counter;
    
    end #
    
    delimiter ; 
    

    The following runtimes were observed:

    call process_results_1mregr_c_ew_f(0,60);
    
    runtime 1 = 03:24.999 Query OK (3 mins 25 secs)
    runtime 2 = 03:32.196 Query OK (3 mins 32 secs)
    
    call process_results_1mregr_c_ew_f(1,60);
    
    runtime 1 = 04:59.861 Query OK (4 mins 59 secs)
    runtime 2 = 04:41.814 Query OK (4 mins 41 secs)
    
    counter
    ========
    23000002 (23 million rows processed in each case)
    

    Hmmmm, performance a bit disappointing so onto the next idea.

    Consider using the innodb engine (shock horror)

    Why innodb ?? because it has clustered indexes ! You will find inserting slower using innodb but hopefully it will be faster to read so it's a trade off that might be worth it.

    Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. For example, MyISAM uses one file for data rows and another for index records.

    More info here :

    • http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

    Simplified innodb results table

    drop table if exists results_innodb;
    create table results_innodb
    (
    rc tinyint unsigned not null,
    df int unsigned not null default 0,
    id int unsigned not null, -- cant auto_inc this !!
    val double(10,4) not null default 0,
    ts timestamp not null default now(),
    primary key (rc, df, id) -- note clustered (innodb only !) composite PK
    )
    engine=innodb;
    

    One problem with innodb is that is doesnt support auto_increment fields that form part of a composite key so you'd have to provide the incrementing key value yourself using a sequence generator, trigger or some other method - perhaps in the application populating the result table itself ??

    Again, I generated 100M rows of data with the key fields having approximately the same cardinality as in your example. Don't worry if these figures don't match the myisam example as innodb estimates the cardinalities so they wont be exactly the same. (but they are - same dataset used)

    show indexes from results_innodb;
    
    Table           Non_unique  Key_name    Seq_in_index    Column_name Collation   Cardinality Index_type
    =====           ==========  ========    ============    =========== =========   =========== ==========
    results_innodb      0       PRIMARY         1               rc          A                18     BTREE   
    results_innodb      0       PRIMARY         2               df          A                18     BTREE   
    results_innodb      0       PRIMARY         3               id          A         100000294     BTREE   
    

    Stored procedure

    The stored procedure is exactly the same as the myisam example above but selects data from the innodb table instead.

    declare v_result_cur cursor for select id from results_innodb where rc = p_rc and df > p_df;
    

    The results are as follows:

    call process_results_innodb(0,60);
    
    runtime 1 = 01:53.407 Query OK (1 mins 53 secs)
    runtime 2 = 01:52.088 Query OK (1 mins 52 secs)
    
    call process_results_innodb(1,60);
    
    runtime 1 = 02:01.201 Query OK (2 mins 01 secs)
    runtime 2 = 01:49.737 Query OK (1 mins 50 secs)
    
    counter
    ========
    23000002 (23 million rows processed in each case)
    

    approx 2-3 mins faster than the myisam engine implementation ! (innodb FTW)

    Divide and Conquer

    Processing the results in a server side stored procedure that uses a cursor might not be an optimal solution especially as MySQL doesnt have support for things such arrays and complex data structures that are readily available in 3GL languages such as C# etc or even in other databases such as Oracle PL/SQL.

    So the idea here is to return batches of data to an application layer (C# whatever) which can then add the results to a collection based data structure and then process the data internally.

    Stored procedure

    The stored procedure takes 3 paramaters rc, df_low and df_high which allows you to select a range of data as follows:

    call list_results_innodb(0,1,1); -- df 1
    call list_results_innodb(0,1,10); -- df between 1 and 10
    call list_results_innodb(0,60,120); -- df between 60 and 120 etc...
    

    obviously the higher the df range the more data you'll be extracting.

    drop procedure if exists list_results_innodb;
    
    delimiter #
    
    create procedure list_results_innodb
    (
    in p_rc tinyint unsigned,
    in p_df_low int unsigned,
    in p_df_high int unsigned
    )
    begin
        select rc, df, id from results_innodb where rc = p_rc and df between p_df_low and p_df_high;
    end #
    
    delimiter ; 
    

    I also knocked up a myisam version also which is identical except for the table that is used.

    call list_results_1mregr_c_ew_f(0,1,1);
    call list_results_1mregr_c_ew_f(0,1,10);
    call list_results_1mregr_c_ew_f(0,60,120);
    

    Based on the cursor example above I would expect the innodb version to out-perform the myisam one.

    I devloped a quick and dirty multi-threaded C# application that will call the stored procedure and add the results to a collection for post query processing. You dont have to use threads, the same batched query approach could be done sequentially without much loss of performance.

    Each thread (QueryThread) selects a range of df data, loops the resultset and adds each result (row) to the results collection.

    class Program
        {
            static void Main(string[] args)
            {
                const int MAX_THREADS = 12; 
                const int MAX_RC = 120;
    
                List<AutoResetEvent> signals = new List<AutoResetEvent>();
                ResultDictionary results = new ResultDictionary(); // thread safe collection
    
                DateTime startTime = DateTime.Now;
                int step = (int)Math.Ceiling((double)MAX_RC / MAX_THREADS) -1; 
    
                int start = 1, end = 0;
                for (int i = 0; i < MAX_THREADS; i++){
                    end = (i == MAX_THREADS - 1) ? MAX_RC : end + step;
                    signals.Add(new AutoResetEvent(false));
    
                    QueryThread st = new QueryThread(i,signals[i],results,0,start,end);
                    start = end + 1;
                }
                WaitHandle.WaitAll(signals.ToArray());
                TimeSpan runTime = DateTime.Now - startTime;
    
                Console.WriteLine("{0} results fetched and looped in {1} secs\nPress any key", results.Count, runTime.ToString());
                Console.ReadKey();
            }
        }
    

    Runtime observed as follows:

    Thread 04 done - 31580517
    Thread 06 done - 44313475
    Thread 07 done - 45776055
    Thread 03 done - 46292196
    Thread 00 done - 47008566
    Thread 10 done - 47910554
    Thread 02 done - 48194632
    Thread 09 done - 48201782
    Thread 05 done - 48253744
    Thread 08 done - 48332639
    Thread 01 done - 48496235
    Thread 11 done - 50000000
    50000000 results fetched and looped in 00:00:55.5731786 secs
    Press any key
    

    So 50 million rows fetched and added to a collection in under 60 seconds.

    I tried the same thing using the myisam stored procedure which took 2 minutes to complete.

    50000000 results fetched and looped in 00:01:59.2144880 secs
    

    Moving to innodb

    In my simplified system the myisam table doesnt perform too badly so it might not be worth migrating to innodb. If you do decided to copy your result data to an innodb table then do it as follows:

    start transaction;
    
    insert into results_innodb 
     select <fields...> from results_1mregr_c_ew_f order by <innodb primary key>;
    
    commit;
    

    Ordering the result by the innodb PK before inserting and wrapping the whole thing up in a transaction will speed things up.

    I hope some of this proves helpful.

    Good luck

    0 讨论(0)
提交回复
热议问题