MySQL and NoSQL: Help me to choose the right one

前端 未结 5 912
予麋鹿
予麋鹿 2020-11-22 03:54

There is a big database, 1,000,000,000 rows, called threads (these threads actually exist, I\'m not making things harder just because of I enjoy it). Threads has only a few

5条回答
  •  死守一世寂寞
    2020-11-22 04:20

    You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !

    http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

    http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

    then design your system something along the lines of the following simplified example:

    Example schema (simplified)

    The important features are that the tables use the innodb engine and the primary key for the threads table is no longer a single auto_incrementing key but a composite clustered key based on a combination of forum_id and thread_id. e.g.

    threads - primary key (forum_id, thread_id)
    
    forum_id    thread_id
    ========    =========
    1                   1
    1                   2
    1                   3
    1                 ...
    1             2058300  
    2                   1
    2                   2
    2                   3
    2                  ...
    2              2352141
    ...
    

    Each forum row includes a counter called next_thread_id (unsigned int) which is maintained by a trigger and increments every time a thread is added to a given forum. This also means we can store 4 billion threads per forum rather than 4 billion threads in total if using a single auto_increment primary key for thread_id.

    forum_id    title   next_thread_id
    ========    =====   ==============
    1          forum 1        2058300
    2          forum 2        2352141
    3          forum 3        2482805
    4          forum 4        3740957
    ...
    64        forum 64       3243097
    65        forum 65      15000000 -- ooh a big one
    66        forum 66       5038900
    67        forum 67       4449764
    ...
    247      forum 247            0 -- still loading data for half the forums !
    248      forum 248            0
    249      forum 249            0
    250      forum 250            0
    

    The disadvantage of using a composite key is that you can no longer just select a thread by a single key value as follows:

    select * from threads where thread_id = y;
    

    you have to do:

    select * from threads where forum_id = x and thread_id = y;
    

    However, your application code should be aware of which forum a user is browsing so it's not exactly difficult to implement - store the currently viewed forum_id in a session variable or hidden form field etc...

    Here's the simplified schema:

    drop table if exists forums;
    create table forums
    (
    forum_id smallint unsigned not null auto_increment primary key,
    title varchar(255) unique not null,
    next_thread_id int unsigned not null default 0 -- count of threads in each forum
    )engine=innodb;
    
    
    drop table if exists threads;
    create table threads
    (
    forum_id smallint unsigned not null,
    thread_id int unsigned not null default 0,
    reply_count int unsigned not null default 0,
    hash char(32) not null,
    created_date datetime not null,
    primary key (forum_id, thread_id, reply_count) -- composite clustered index
    )engine=innodb;
    
    delimiter #
    
    create trigger threads_before_ins_trig before insert on threads
    for each row
    begin
    declare v_id int unsigned default 0;
    
      select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
      set new.thread_id = v_id;
      update forums set next_thread_id = v_id where forum_id = new.forum_id;
    end#
    
    delimiter ;
    

    You may have noticed I've included reply_count as part of the primary key which is a bit strange as (forum_id, thread_id) composite is unique in itself. This is just an index optimisation which saves some I/O when queries that use reply_count are executed. Please refer to the 2 links above for further info on this.

    Example queries

    I'm still loading data into my example tables and so far I have a loaded approx. 500 million rows (half as many as your system). When the load process is complete I should expect to have approx:

    250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)
    

    I've deliberately made some of the forums contain more than 5 million threads for example, forum 65 has 15 million threads:

    forum_id    title   next_thread_id
    ========    =====   ==============
    65        forum 65      15000000 -- ooh a big one
    

    Query runtimes

    select sum(next_thread_id) from forums;
    
    sum(next_thread_id)
    ===================
    539,155,433 (500 million threads so far and still growing...)
    

    under innodb summing the next_thread_ids to give a total thread count is much faster than the usual:

    select count(*) from threads;
    

    How many threads does forum 65 have:

    select next_thread_id from forums where forum_id = 65
    
    next_thread_id
    ==============
    15,000,000 (15 million)
    

    again this is faster than the usual:

    select count(*) from threads where forum_id = 65
    

    Ok now we know we have about 500 million threads so far and forum 65 has 15 million threads - let's see how the schema performs :)

    select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;
    
    runtime = 0.022 secs
    
    select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;
    
    runtime = 0.027 secs
    

    Looks pretty performant to me - so that's a single table with 500+ million rows (and growing) with a query that covers 15 million rows in 0.02 seconds (while under load !)

    Further optimisations

    These would include:

    • partitioning by range

    • sharding

    • throwing money and hardware at it

    etc...

    hope you find this answer helpful :)

提交回复
热议问题