Optimizing multiple joins

后端 未结 3 533
悲&欢浪女
悲&欢浪女 2021-02-04 07:14

I\'m trying to figure out a way to speed up a particularly cumbersome query which aggregates some data by date across a couple of tables. The full (ugly) query is below along w

相关标签:
3条回答
  • 2021-02-04 07:26

    There are always 2 things to consider when optimising queries:

    • What indexes can be used (you may need to create indexes)
    • How the query is written (you may need to change the query to allow the query optimser to be able to find appropriate indexes, and to not re-read data redundantly)

    A few observations:

    • You are performing date manipulations before you join your dates. As a general rule this will prevent a query optimser from using an index even if it exists. You should try to write your expressions in such a way that indexed columns exist unaltered on one side of the expression.

    • Your subqueries are filtering to the same date range as generate_series. This is a duplication, and it limits the optimser's ability to choose the most efficient optimisation. I suspect that may have been written in to improve performance because the optimser was unable to use an index on the date column (body_time)?

    • NOTE: We would actually very much like to use an index on Body.body_time

    • ORDER BY within the subqueries is at best redundant. At worst it could force the query optimiser to sort the result set before joining; and that is not necessarily good for the query plan. Rather only apply ordering right at the end for final display.

    • Use of LEFT JOIN in your subqueries is inappropriate. Assuming you're using ANSI conventions for NULL behaviour (and you should be), any outer joins to envelope would return envelope_command=NULL, and these would consequently be excluded by the condition envelope_command=?.

    • Subqueries o and i are almost identical save for the envelope_command value. This forces the optimser to scan the same underlying tables twice. You can use a pivot table technique to join to the data once, and split the values into 2 columns.

    Try the following which uses the pivot technique:

    SELECT  p.period,
            /*The pivot technique in action...*/
            SUM(
            CASE WHEN envelope_command = 1 THEN body_size
            ELSE 0
            END) AS Outbound,
            SUM(
            CASE WHEN envelope_command = 2 THEN body_size
            ELSE 0
            END) AS Inbound
    FROM    (
            SELECT  date '2009-10-01' + s.day AS period
            FROM    generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
            ) AS p 
            /*The left JOIN is justified to ensure ALL generated dates are returned
              Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
            LEFT OUTER JOIN (
            SELECT  b.body_size,
                    b.body_time,
                    e.envelope_command
            FROM    body AS b 
                    INNER JOIN envelope e 
                      ON e.message_id = b.message_id 
            WHERE   envelope_command IN (1, 2)
            ) d
              /*The expressions below allow the optimser to use an index on body_time if 
                the statistics indicate it would be beneficial*/
              ON d.body_time >= p.period
             AND d.body_time < p.period + INTERVAL '1 DAY'
    GROUP BY p.Period
    ORDER BY p.Period
    

    EDIT: Added filter suggested by Tom H.

    0 讨论(0)
  • 2021-02-04 07:34

    I uninstalled my PostgreSQL server a couple of days ago, so you'll likely have to play around with this, but hopefully it's a good start for you.

    The keys are:

    1. You shouldn't need the subqueries - just do the direct joins and aggregate
    2. You should be able to use INNER JOINs, which are typically more performant than OUTER JOINs

    If nothing else, I think that the query below is a bit clearer.

    I used a calendar table in my query, but you can replace that with the generate_series as you were using it.

    Also, depending on indexing, it might be better to compare the body_date with >= and < rather than pulling out the date part and comparing. I don't know enough about PostgreSQL to know how it works behind the scenes, so I would try both approaches to see which the server can optimize better. In pseudo-code you would be doing: body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight).

    SELECT
        CAL.calendar_date AS period,
        SUM(O.body_size) AS outbound,
        SUM(I.body_size) AS inbound
    FROM
        Calendar CAL
    INNER JOIN Body OB ON
        OB.body_time::date = CAL.calendar_date
    INNER JOIN Envelope OE ON
        OE.message_id = OB.message_id AND
        OE.envelope_command = 1
    INNER JOIN Body IB ON
        IB.body_time::date = CAL.calendar_date
    INNER JOIN Envelope IE ON
        IE.message_id = IB.message_id AND
        IE.envelope_command = 2
    GROUP BY
        CAL.calendar_date
    
    0 讨论(0)
  • 2021-02-04 07:46

    Building on Craig Young's suggestions, here is the amended query which runs in ~1.8 seconds for the data set I'm working on. That is a slight improvement on the original ~2.0s and a huge improvement on Craig's which took ~22s.

    SELECT
        p.period,
        /* The pivot technique... */
        SUM(CASE envelope_command WHEN 1 THEN body_size ELSE 0 END) AS Outbound,
        SUM(CASE envelope_command WHEN 2 THEN body_size ELSE 0 END) AS Inbound
    FROM
    (
        /* Get days range */
        SELECT date '2009-10-01' + day AS period
        FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS day
    ) p
        /* Join message information */
        LEFT OUTER JOIN
        (
            SELECT b.body_size, b.body_time::date, e.envelope_command
            FROM body AS b 
                INNER JOIN envelope e ON e.message_id = b.message_id 
            WHERE
                e.envelope_command IN (2, 1)
                AND b.body_time::date BETWEEN (date '2009-10-01') AND (date '2009-10-31')
        ) d ON d.body_time = p.period
    GROUP BY p.period
    ORDER BY p.period
    
    0 讨论(0)
提交回复
热议问题