Optimizing ORDER BY

前端 未结 5 1623
野趣味
野趣味 2021-01-21 08:21

I am trying to optimize this query that sorts posts by reputation field (1st) and then id field (2nd). Without 1st field query takes ~0.25

相关标签:
5条回答
  • 2021-01-21 08:33
    select * 
    from (
      SELECT -- everything is ok here
      , CASE 
          WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
            THEN + p.reputation ELSE NULL END order_col
      FROM posts AS p
      WHERE p.status = 'published' -- the only thing for filter
      LIMIT 0,10 -- limit provided as well
    ) a
    ORDER BY 
        a.order_col desc
        ,a.id DESC
    
    0 讨论(0)
  • 2021-01-21 08:44

    This is a very interesting query. During its optimisation you may discover and understand a lot of new information about how MySQL works. I am not sure that I will have time to write everything in details at once, but I can gradually update.

    Why it is slow

    There are basically two scenarios: a quick and a slow.

    In a quick scenario you are walking in some predefined order over a table and probably at the same time quickly fetch some data by id for each row from other tables. It this case you stop walking as soon as you have enough rows specified by your LIMIT clause. Where does the order come from? From a b-tree index that you have on the table or the order of a result set in a subquery.

    In a slow scenario you do not have that predefined order, and MySQL has to implicitly put all data into a temporary table, sort the table on some field and return the n rows from your LIMIT clause. If any of the fields that you put into that temporary table is of type TEXT (not VARCHAR), MySQL does not even try to keep that table in RAM and flushes and sorts it on disk (hence additional IO processing).

    First thing to fix

    There are many situations when you can not build an index that will allow you to follow its order (when you ORDER BY columns from different tables, for example), so the rule of thumb in such situations is to minimise the data that MySQL will put in the temporary table. How can you do it? You select only identifiers of the rows in a subquery and after you have the ids, you join the ids to the table itself and other tables to fetch the content. That is you make a small table with an order and then use the quick scenario. (This slightly contradicts to SQL in general, but each flavor of SQL has its own means to optimise queries that way).

    Coincidentally, your SELECT -- everything is ok here looks funny, since it is the first place where it is not ok.

    SELECT p.*
        , u.name user_name, u.status user_status
        , c.name city_name, t.name town_name, d.name dist_name
        , pm.meta_name, pm.meta_email, pm.meta_phone
        , (SELECT concat("{", 
            '"id":"', pc.id, '",', 
            '"content":"', replace(pc.content, '"', '\\"'), '",', 
            '"date":"', pc.date, '",', 
            '"user_id":"', pcu.id, '",', 
            '"user_name":"', pcu.name, '"}"') last_comment_json 
            FROM post_comments pc 
            LEFT JOIN users pcu ON (pcu.id = pc.user_id) 
            WHERE pc.post_id = p.id
            ORDER BY pc.id DESC LIMIT 1) AS last_comment
    FROM (
        SELECT id
        FROM posts p
        WHERE p.status = 'published'
        ORDER BY 
            (CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
                THEN +p.reputation ELSE NULL END) DESC, 
            p.id DESC
        LIMIT 0,10
    ) ids
    JOIN posts p ON ids.id = p.id  -- mind the join for the p data
    LEFT JOIN users u ON (u.id = p.user_id)
    LEFT JOIN citys c ON (c.id = p.city_id)
    LEFT JOIN towns t ON (t.id = p.town_id)
    LEFT JOIN dists d ON (d.id = p.dist_id)
    LEFT JOIN post_metas pm ON (pm.post_id = p.id)
    ;
    

    That is the first step, but even now you can see that you do not need to make these useless LEFT JOINS and json serialisations for the rows you do not need. (I skipped GROUP BY p.id, because I do not see which LEFT JOIN might result in several rows, you do not do any aggregation).

    yet to write about:

    • indexes
    • reformulate CASE clause (use UNION ALL)
    • probably forcing an index
    0 讨论(0)
  • 2021-01-21 08:47
    • Inflate-deflate -- LEFT JOIN inflates the number of rows, GROUP BY then deflates. The inflated number of rows is costly. Instead, focus on getting the ids for the desired rows before doing any JOINing. With luck, you can get rid of the GROUP BY.

    • WP schema -- This is an EAV schema which sucks when it comes to performance and scaling.

    • What indexes do you have? See this for how to improve the meta table.

    • Complex ORDER BY. This leads to gathering all the rows (after filtering) before sorting and doing the LIMIT. Rethink the ORDER BY clause, if possible.

    After you have done what you can with my suggestions, start another Question to continue the refinement. Be sure to include EXPLAIN SELECT ... and SHOW CREATE TABLE.

    0 讨论(0)
  • 2021-01-21 08:55

    Perhaps an index with columns: id, reputation, created_at could help to speed up a bit, That would be the easiest solution, if you did not try that yet. The DBMS would not have to read so much data, to calculate the offset, limit - affected records.

    0 讨论(0)
  • 2021-01-21 08:58

    Here's your problem:

    • "ORDER BY expression": the expression has to be computed for each row in the table, then the sort done on the whole table, then the results go through the LIMIT.
    • No index use: "ORDER BY col" when "col" is part of an index can eliminate the sort by going through the index in-order. This is very efficient when using LIMIT. However, it will not work here.

    There are ways out of this mess, but you will need to tell how many different levels of "reputation" you have (like 3, or like "a lot") and how they are statistically distributed (like, 1 user with reputation 100 and the rest all have zero, or evenly distributed).

    EDIT

    Hmm, no information on the statistical distribution of "reputation" or its possible range of values. In this case, let's go with the blunt approach:

    Let's add a column "repdate" which contains:

    repdate = p.created_at + INTERVAL p.reputation DAY
    

    This corresponds to shifting posts one day into the future for each reputation point they have. They will then sort accordingly. Adjust to taste if p.created_at is not a DATETIME.

    Now, we can simply "ORDER BY repdate DESC" and with an index on it, it will be fast.

    0 讨论(0)
提交回复
热议问题