MySQL and NoSQL: Help me to choose the right one

前端 未结 5 915
予麋鹿
予麋鹿 2020-11-22 03:54

There is a big database, 1,000,000,000 rows, called threads (these threads actually exist, I\'m not making things harder just because of I enjoy it). Threads has only a few

5条回答
  •  旧巷少年郎
    2020-11-22 04:24

    EDIT: Your one-column indices are not enough. You would need to, at least, cover the three involved columns.

    More advanced solution: replace replycount > 1 with hasreplies = 1 by creating a new hasreplies field that equals 1 when replycount > 1. Once this is done, create an index on the three columns, in that order: INDEX(forumid, hasreplies, dateline). Make sure it's a BTREE index to support ordering.

    You're selecting based on:

    • a given forumid
    • a given hasreplies
    • ordered by dateline

    Once you do this, your query execution will involve:

    • moving down the BTREE to find the subtree that matches forumid = X. This is a logarithmic operation (duration : log(number of forums)).
    • moving further down the BTREE to find the subtree that matches hasreplies = 1 (while still matching forumid = X). This is a constant-time operation, because hasreplies is only 0 or 1.
    • moving through the dateline-sorted subtree in order to get the required results, without having to read and re-sort the entire list of items in the forum.

    My earlier suggestion to index on replycount was incorrect, because it would have been a range query and thus prevented the use of a dateline to sort the results (so you would have selected the threads with replies very fast, but the resulting million-line list would have had to be sorted completely before looking for the 100 elements you needed).

    IMPORTANT: while this improves performance in all cases, your huge OFFSET value (10000!) is going to decrease performance, because MySQL does not seem to be able to skip ahead despite reading straight through a BTREE. So, the larger your OFFSET is, the slower the request will become.

    I'm afraid the OFFSET problem is not automagically solved by spreading the computation over several computations (how do you skip an offset in parallel, anyway?) or moving to NoSQL. All solutions (including NoSQL ones) will boil down to simulating OFFSET based on dateline (basically saying dateline > Y LIMIT 100 instead of LIMIT Z, 100 where Y is the date of the item at offset Z). This works, and eliminates any performance issues related to the offset, but prevents going directly to page 100 out of 200.

提交回复
热议问题