General rules for simplifying SQL statements

前端 未结 8 1503
心在旅途
心在旅途 2020-12-22 17:15

I\'m looking for some \"inference rules\" (similar to set operation rules or logic rules) which I can use to reduce a SQL query in complexity or size. Does there exist some

相关标签:
8条回答
  • 2020-12-22 17:34

    My approach is to learn relational theory in general and relational algebra in particular. Then learn to spot the constructs used in SQL to implement operators from the relational algebra (e.g. universal quantification a.k.a. division) and calculus (e.g. existential quantification). The gotcha is that SQL has features not found in the relational model e.g. nulls, which are probably best refactored away anyhow. Recommended reading: SQL and Relational Theory: How to Write Accurate SQL Code By C. J. Date.

    In this vein, I'm not convinced "the fact that most SUBSELECTs can be rewritten as a JOIN" represents a simplification.

    Take this query for example:

    SELECT c 
      FROM T1 
     WHERE c NOT IN ( SELECT c FROM T2 );
    

    Rewrite using JOIN

    SELECT DISTINCT T1.c 
      FROM T1 NATURAL LEFT OUTER JOIN T2 
     WHERE T2.c IS NULL;
    

    The join is more verbose!

    Alternatively, recognize the construct is implementing an antijoin on the projection of c e.g. pseudo algrbra

    T1 { c } antijoin T2 { c }
    

    Simplification using relational operators:

    SELECT c FROM T1 EXCEPT SELECT c FROM T2;
    
    0 讨论(0)
  • 2020-12-22 17:36

    Here's a few from working with Oracle 8 & 9 (of course, sometimes doing the opposite might make the query simpler or faster):

    Parentheses can be removed if they are not used to override operator precedence. A simple example is when all the boolean operators in your where clause are the same: where ((a or b) or c) is equivalent to where a or b or c.

    A sub-query can often (if not always) be merged with the main query to simplify it. In my experience, this often improves performance considerably:

    select foo.a,
           bar.a
      from foomatic  foo,
           bartastic bar
     where foo.id = bar.id and
           bar.id = (
             select ban.id
               from bantabulous ban
              where ban.bandana = 42
           )
    ;
    

    is equivalent to

    select foo.a,
           bar.a
      from foomatic    foo,
           bartastic   bar,
           bantabulous ban
     where foo.id = bar.id and
           bar.id = ban.id and
           ban.bandana = 42
    ;
    

    Using ANSI joins separates a lot of "code monkey" logic from the really interesting parts of the where clause: The previous query is equivalent to

    select foo.a,
           bar.a
      from foomatic    foo
      join bartastic   bar on bar.id = foo.id
      join bantabulous ban on ban.id = bar.id
     where ban.bandana = 42
    ;
    

    If you want to check for the existence of a row, don't use count(*), instead use either rownum = 1 or put the query in a where exists clause to fetch only one row instead of all.

    0 讨论(0)
  • 2020-12-22 17:40
    • I suppose the obvious one is look for any Cursors that can be replaced with a SQL 'Set' based operation.
    • Next on my list, is look for any correlated sub-queries that can be re-written as a un-correlated query
    • In long stored procedures, break out separate SQL statements into their own stored procedures. That way they will get there own cached query plan.
    • Look for transactions that can have their scope shortened. I regularly find statements inside a transaction that can safely be outside.
    • Sub-selects can often be re-written as straight forward joins (modern optimisers are good at spotting simple ones)

    As @Quassnoi mentioned, the Optimiser often does a good job. One way to help it is to ensure indexes and statistics are up to date, and that suitable indexes exist for your query workload.

    0 讨论(0)
  • 2020-12-22 17:44

    Given the nature of SQL, you absolutely have to be aware of the performance implications of any refactoring. Refactoring SQL Applications is a good resource on refactoring with a heavy emphasis on performance (see Chapter 5).

    0 讨论(0)
  • 2020-12-22 17:45

    To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?

    That's exactly what optimizers do for a living (not that I'm saying they always do this well).

    Since SQL is a set based language, there are usually more than one way to transform one query to other.

    Like this query:

    SELECT  *
    FROM    mytable
    WHERE   col1 > @value1 OR col2 < @value2
    

    can be transformed into this:

    SELECT  *
    FROM    mytable
    WHERE   col1 > @value1
    UNION
    SELECT  *
    FROM    mytable
    WHERE   col2 < @value2
    

    or this:

    SELECT  mo.*
    FROM    (
            SELECT  id
            FROM    mytable
            WHERE   col1 > @value1
            UNION
            SELECT  id
            FROM    mytable
            WHERE   col2 < @value2
            ) mi
    JOIN    mytable mo
    ON      mo.id = mi.id
    

    , which look uglier but can yield better execution plans.

    One of the most common things to do is replacing this query:

    SELECT  *
    FROM    mytable
    WHERE   col IN
            (
            SELECT  othercol
            FROM    othertable
            )
    

    with this one:

    SELECT  *
    FROM    mytable mo
    WHERE   EXISTS
            (
            SELECT  NULL
            FROM    othertable o
            WHERE   o.othercol = mo.col
            )
    

    In some RDBMS's (like PostgreSQL), DISTINCT and GROUP BY use the different execution plans, so sometimes it's better to replace one with the other:

    SELECT  mo.grouper,
            (
            SELECT  SUM(col)
            FROM    mytable mi
            WHERE   mi.grouper = mo.grouper
            )
    FROM    (
            SELECT  DISTINCT grouper
            FROM    mytable
            ) mo
    

    vs.

    SELECT  mo.grouper, SUM(col)
    FROM    mytable
    GROUP BY
            mo.grouper
    

    In PostgreSQL, DISTINCT sorts and GROUP BY hashes.

    MySQL lacks FULL OUTER JOIN, so it can be rewritten as folloing:

    SELECT  t1.col1, t2.col2
    FROM    table1 t1
    LEFT OUTER JOIN
            table2 t2
    ON      t1.id = t2.id
    

    vs.

    SELECT  t1.col1, t2.col2
    FROM    table1 t1
    LEFT JOIN
            table2 t2
    ON      t1.id = t2.id
    UNION ALL
    SELECT  NULL, t2.col2
    FROM    table1 t1
    RIGHT JOIN
            table2 t2
    ON      t1.id = t2.id
    WHERE   t1.id IS NULL
    

    , but see this article in my blog on how to do this more efficiently in MySQL:

    • Emulating FULL OUTER JOIN in MySQL

    This hierarchical query in Oracle:

    SELECT  DISTINCT(animal_id) AS animal_id
    FROM    animal
    START WITH
            animal_id = :id
    CONNECT BY
            PRIOR animal_id IN (father, mother)
    ORDER BY
            animal_id
    

    can be transformed to this:

    SELECT  DISTINCT(animal_id) AS animal_id
    FROM    (
            SELECT  0 AS gender, animal_id, father AS parent
            FROM    animal
            UNION ALL
            SELECT  1, animal_id, mother
            FROM    animal
            )
    START WITH
            animal_id = :id
    CONNECT BY
            parent = PRIOR animal_id
    ORDER BY
            animal_id
    

    , the latter one being more performant.

    See this article in my blog for the execution plan details:

    • Genealogy query on both parents

    To find all ranges that overlap the given range, you can use the following query:

    SELECT  *
    FROM    ranges
    WHERE   end_date >= @start
            AND start_date <= @end
    

    , but in SQL Server this more complex query yields same results faster:

    SELECT  *
    FROM    ranges
    WHERE   (start_date > @start AND start_date <= @end)
            OR (@start BETWEEN start_date AND end_date)
    

    , and believe it or not, I have an article in my blog on this too:

    • Overlapping ranges: SQL Server

    SQL Server also lacks an efficient way to do cumulative aggregates, so this query:

    SELECT  mi.id, SUM(mo.value) AS running_sum
    FROM    mytable mi
    JOIN    mytable mo
    ON      mo.id <= mi.id
    GROUP BY
            mi.id
    

    can be more efficiently rewritten using, Lord help me, cursors (you heard me right: cursors, more efficiently and SQL Server in one sentence).

    See this article in my blog on how to do it:

    • Flattening timespans: SQL Server

    There is a certain kind of query commonly met in financial applications that searches for the effective rate for a currency, like this one in Oracle:

    SELECT  TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
    FROM    t_transaction x
    JOIN    t_rate r
    ON      (rte_currency, rte_date) IN
            (
            SELECT  xac_currency, MAX(rte_date)
            FROM    t_rate
            WHERE   rte_currency = xac_currency
                    AND rte_date <= xac_date
            )
    

    This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:

    WITH v_rate AS
            (
            SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
            FROM    (
                    SELECT  cur_id, dte_date,
                            (
                            SELECT  MAX(rte_date)
                            FROM    t_rate ri
                            WHERE   rte_currency = cur_id
                                    AND rte_date <= dte_date
                            ) AS rte_effdate
                    FROM    (
                            SELECT  (
                                    SELECT  MAX(rte_date)
                                    FROM    t_rate
                                    ) - level + 1 AS dte_date
                            FROM    dual
                            CONNECT BY
                                    level <=
                                    (
                                    SELECT  MAX(rte_date) - MIN(rte_date)
                                    FROM    t_rate
                                    )
                            ) v_date,
                            (
                            SELECT  1 AS cur_id
                            FROM    dual
                            UNION ALL
                            SELECT  2 AS cur_id
                            FROM    dual
                            ) v_currency
                    ) v_eff
            LEFT JOIN
                    t_rate
            ON      rte_currency = cur_id
                    AND rte_date = rte_effdate
            )
    SELECT  TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
    FROM    (
            SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
            FROM    t_transaction x
            GROUP BY
                    xac_currency, TRUNC(xac_date)
            )
    JOIN    v_rate
    ON      eff_currency = xac_currency
            AND eff_date = xac_date
    

    Despite being bulky as a hell, the latter query is 6 times faster.

    The main idea here is replacing <= with =, which requires building an in-memory calendar table. to JOIN with.

    • Converting currencies
    0 讨论(0)
  • 2020-12-22 17:46

    I like to replace all sort of subselect by join query.

    This one is obvious :

    SELECT  *
    FROM    mytable mo
    WHERE   EXISTS
            (
              SELECT  *
              FROM    othertable o
              WHERE   o.othercol = mo.col
            )
    

    by

    SELECT  mo.*
    FROM    mytable mo inner join othertable o on o.othercol = mo.col
    

    And this one is under estimate :

    SELECT  *
    FROM    mytable mo
    WHERE   NOT EXISTS
            (
              SELECT  *
              FROM    othertable o
              WHERE   o.othercol = mo.col
            )
    

    by

    SELECT  mo.*
    FROM    mytable mo left outer join othertable o on o.othercol = mo.col
    WHERE   o.othercol is null
    

    It could help the DBMS to choose the good execution plan in a big request.

    0 讨论(0)
提交回复
热议问题