I am trying to optimize this query that sorts posts
by reputation
field (1st) and then id
field (2nd). Without 1st field query takes ~0.25
select *
from (
SELECT -- everything is ok here
, CASE
WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY)
THEN + p.reputation ELSE NULL END order_col
FROM posts AS p
WHERE p.status = 'published' -- the only thing for filter
LIMIT 0,10 -- limit provided as well
) a
ORDER BY
a.order_col desc
,a.id DESC
This is a very interesting query. During its optimisation you may discover and understand a lot of new information about how MySQL works. I am not sure that I will have time to write everything in details at once, but I can gradually update.
There are basically two scenarios: a quick and a slow.
In a quick scenario you are walking in some predefined order over a table and probably at the same time quickly fetch some data by id for each row from other tables. It this case you stop walking as soon as you have enough rows specified by your LIMIT clause. Where does the order come from? From a b-tree index that you have on the table or the order of a result set in a subquery.
In a slow scenario you do not have that predefined order, and MySQL has to implicitly put all data into a temporary table, sort the table on some field and return the n rows from your LIMIT clause. If any of the fields that you put into that temporary table is of type TEXT (not VARCHAR), MySQL does not even try to keep that table in RAM and flushes and sorts it on disk (hence additional IO processing).
There are many situations when you can not build an index that will allow you to follow its order (when you ORDER BY columns from different tables, for example), so the rule of thumb in such situations is to minimise the data that MySQL will put in the temporary table. How can you do it? You select only identifiers of the rows in a subquery and after you have the ids, you join the ids to the table itself and other tables to fetch the content. That is you make a small table with an order and then use the quick scenario. (This slightly contradicts to SQL in general, but each flavor of SQL has its own means to optimise queries that way).
Coincidentally, your SELECT -- everything is ok here
looks funny, since it is the first place where it is not ok.
SELECT p.*
, u.name user_name, u.status user_status
, c.name city_name, t.name town_name, d.name dist_name
, pm.meta_name, pm.meta_email, pm.meta_phone
, (SELECT concat("{",
'"id":"', pc.id, '",',
'"content":"', replace(pc.content, '"', '\\"'), '",',
'"date":"', pc.date, '",',
'"user_id":"', pcu.id, '",',
'"user_name":"', pcu.name, '"}"') last_comment_json
FROM post_comments pc
LEFT JOIN users pcu ON (pcu.id = pc.user_id)
WHERE pc.post_id = p.id
ORDER BY pc.id DESC LIMIT 1) AS last_comment
FROM (
SELECT id
FROM posts p
WHERE p.status = 'published'
ORDER BY
(CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY)
THEN +p.reputation ELSE NULL END) DESC,
p.id DESC
LIMIT 0,10
) ids
JOIN posts p ON ids.id = p.id -- mind the join for the p data
LEFT JOIN users u ON (u.id = p.user_id)
LEFT JOIN citys c ON (c.id = p.city_id)
LEFT JOIN towns t ON (t.id = p.town_id)
LEFT JOIN dists d ON (d.id = p.dist_id)
LEFT JOIN post_metas pm ON (pm.post_id = p.id)
;
That is the first step, but even now you can see that you do not need to make these useless LEFT JOINS and json serialisations for the rows you do not need. (I skipped GROUP BY p.id
, because I do not see which LEFT JOIN might result in several rows, you do not do any aggregation).
yet to write about:
Inflate-deflate -- LEFT JOIN
inflates the number of rows, GROUP BY
then deflates. The inflated number of rows is costly. Instead, focus on getting the ids for the desired rows before doing any JOINing
. With luck, you can get rid of the GROUP BY
.
WP schema -- This is an EAV schema which sucks when it comes to performance and scaling.
What indexes do you have? See this for how to improve the meta table.
Complex ORDER BY
. This leads to gathering all the rows (after filtering) before sorting and doing the LIMIT
. Rethink the ORDER BY
clause, if possible.
After you have done what you can with my suggestions, start another Question to continue the refinement. Be sure to include EXPLAIN SELECT ...
and SHOW CREATE TABLE
.
Perhaps an index with columns: id
, reputation
, created_at
could help to speed up a bit, That would be the easiest solution, if you did not try that yet. The DBMS would not have to read so much data, to calculate the offset, limit - affected records.
Here's your problem:
There are ways out of this mess, but you will need to tell how many different levels of "reputation" you have (like 3, or like "a lot") and how they are statistically distributed (like, 1 user with reputation 100 and the rest all have zero, or evenly distributed).
EDIT
Hmm, no information on the statistical distribution of "reputation" or its possible range of values. In this case, let's go with the blunt approach:
Let's add a column "repdate" which contains:
repdate = p.created_at + INTERVAL p.reputation DAY
This corresponds to shifting posts one day into the future for each reputation point they have. They will then sort accordingly. Adjust to taste if p.created_at is not a DATETIME.
Now, we can simply "ORDER BY repdate DESC" and with an index on it, it will be fast.