SQL query performance question (multiple sub-queries)

限于喜欢 提交于 2019-12-13 03:48:01

问题


I have this query:

SELECT p.id, r.status, r.title
FROM page AS p
    INNER JOIN page_revision as r ON r.pageId = p.id AND (
        r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId AND r2.status = 'active')
        OR r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
    )

Which returns each page and the latest active revision for each, unless no active revision is available, in which case it simply returns the latest revision.

Is there any way this can be optimised to improve performance or just general readability? I'm not having any issues right now, but my worry is that when this gets into a production environment (where there could be a lot of pages) it's going to perform badly.

Also, are there any obvious problems I should be aware of? The use of sub-queries always bugs me, but to the best of my knowledge this cant be done without them.

Note:
The reason the conditions are in the JOIN rather than a WHERE clause is that in other queries (where this same logic is used) I'm LEFT JOINing from the "site" table to the "page" table, and If no pages exist I still want the site returned.

Jack

Edit: I'm using MySQL


回答1:


If "active" is the first in alphabetical order you migt be able to reduce subqueries to:

SELECT p.id, r.status, r.title
FROM page AS p
    INNER JOIN page_revision as r ON r.pageId = p.id AND 
        r.id = (SELECT r2.id 
                FROM page_revision as r2 
                WHERE r2.pageId = r.pageId 
                ORDER BY r2.status, r2.id DESC
                LIMIT 1)

Otherwise you can replace ORDER BY line with

ORDER BY CASE r2.status WHEN 'active' THEN 0 ELSE 1 END, r2.id DESC

These all come from my assumptions on SQL Server, your mileage with MySQL may vary.




回答2:


Maybe a little re-factoring is in order?

If you added a latest_revision_id column onto pages your problem would disappear, hopefully with only a couple of lines added to your page editor.

I know it's not normalized but it would simplify (and greatly speed up) the query, and sometimes you do have to denormalize for performance.




回答3:


Your problem is a particular case of what is described in this question.

The best you can get using standard ANSI SQL seems to be:

SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id 
AND r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)

Other approaches are available but dependent on what database you're using. I'm not really sure it can be improved much for MySQL.




回答4:


In MS SQL 2005+ and Oracle:

SELECT p.id, r.status, r.title
FROM (
  SELECT p.*, r,*,
         ROW_NUMBER() OVER (PARTITION BY p.pageId ORDER BY CASE WHEN p.status = 'active' THEN 0 ELSE 1 END, r.id DESC) AS rn
  FROM page AS p, page_revision r
  WHERE r.id = p.pageId
  ) o
WHERE rn = 1

In MySQL that can become a problem, as subqueries cannot use the INDEX RANGE SCAN as the expression from the outer query is not considered constant.

You'll need to create two indexes and a function that returns the last page revision to use those indexes:

CREATE INDEX ix_revision_page_status_id ON page_revision (page_id, id, status);

CREATE INDEX ix_revision_page_id (page_id, id);

CREATE FUNCTION `fn_get_last_revision`(input_id INT) RETURNS int(11)
BEGIN
  DECLARE id INT;
  SELECT r_id
  INTO id
  FROM (
    SELECT r.id
    FROM page_revisions
    FORCE INDEX (ix_revision_page_status_id)
    WHERE page_id = input_id
      AND status = 'active'
    ORDER BY id DESC 
    LIMIT 1
    UNION ALL
    SELECT r.id
    FROM page_revisions
    FORCE INDEX (ix_revision_page_id)
    WHERE page_id = input_id
    ORDER BY id DESC 
    LIMIT 1
  ) o
  LIMIT 1;
  RETURN id;
END;

SELECT po.id, r.status, r.title
FROM (
  SELECT p.*, fn_get_last_revision(p.page_id) AS rev_id
  FROM page p
) po, page_revision r
WHERE r.id = po.rev_id;

This will efficiently use index to get the last revision of the page.

P. S. If you will use codes for statuses and use 0 for active, you can get rid of the second index and simplify the query.



来源:https://stackoverflow.com/questions/560777/sql-query-performance-question-multiple-sub-queries

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!