问题
I have this query:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND (
r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId AND r2.status = 'active')
OR r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
)
Which returns each page and the latest active revision for each, unless no active revision is available, in which case it simply returns the latest revision.
Is there any way this can be optimised to improve performance or just general readability? I'm not having any issues right now, but my worry is that when this gets into a production environment (where there could be a lot of pages) it's going to perform badly.
Also, are there any obvious problems I should be aware of? The use of sub-queries always bugs me, but to the best of my knowledge this cant be done without them.
Note:
The reason the conditions are in the JOIN rather than a WHERE clause is that in other queries (where this same logic is used) I'm LEFT JOINing from the "site" table to the "page" table, and If no pages exist I still want the site returned.
Jack
Edit: I'm using MySQL
回答1:
If "active" is the first in alphabetical order you migt be able to reduce subqueries to:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND
r.id = (SELECT r2.id
FROM page_revision as r2
WHERE r2.pageId = r.pageId
ORDER BY r2.status, r2.id DESC
LIMIT 1)
Otherwise you can replace ORDER BY line with
ORDER BY CASE r2.status WHEN 'active' THEN 0 ELSE 1 END, r2.id DESC
These all come from my assumptions on SQL Server, your mileage with MySQL may vary.
回答2:
Maybe a little re-factoring is in order?
If you added a latest_revision_id
column onto pages
your problem would disappear, hopefully with only a couple of lines added to your page editor.
I know it's not normalized but it would simplify (and greatly speed up) the query, and sometimes you do have to denormalize for performance.
回答3:
Your problem is a particular case of what is described in this question.
The best you can get using standard ANSI SQL seems to be:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id
AND r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
Other approaches are available but dependent on what database you're using. I'm not really sure it can be improved much for MySQL.
回答4:
In MS SQL 2005+
and Oracle
:
SELECT p.id, r.status, r.title
FROM (
SELECT p.*, r,*,
ROW_NUMBER() OVER (PARTITION BY p.pageId ORDER BY CASE WHEN p.status = 'active' THEN 0 ELSE 1 END, r.id DESC) AS rn
FROM page AS p, page_revision r
WHERE r.id = p.pageId
) o
WHERE rn = 1
In MySQL
that can become a problem, as subqueries cannot use the INDEX RANGE SCAN
as the expression from the outer query is not considered constant.
You'll need to create two indexes and a function that returns the last page revision to use those indexes:
CREATE INDEX ix_revision_page_status_id ON page_revision (page_id, id, status);
CREATE INDEX ix_revision_page_id (page_id, id);
CREATE FUNCTION `fn_get_last_revision`(input_id INT) RETURNS int(11)
BEGIN
DECLARE id INT;
SELECT r_id
INTO id
FROM (
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_status_id)
WHERE page_id = input_id
AND status = 'active'
ORDER BY id DESC
LIMIT 1
UNION ALL
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_id)
WHERE page_id = input_id
ORDER BY id DESC
LIMIT 1
) o
LIMIT 1;
RETURN id;
END;
SELECT po.id, r.status, r.title
FROM (
SELECT p.*, fn_get_last_revision(p.page_id) AS rev_id
FROM page p
) po, page_revision r
WHERE r.id = po.rev_id;
This will efficiently use index to get the last revision of the page.
P. S. If you will use codes for statuses and use 0
for active, you can get rid of the second index and simplify the query.
来源:https://stackoverflow.com/questions/560777/sql-query-performance-question-multiple-sub-queries