I have the following query:
SELECT table_1.id
FROM
table_1
LEFT JOIN table_2 ON (table_1.id = table_2.id)
WHERE
table_1.col_condition_1 = 0
AND table_1.col
Problems like this tend to require trying things and testing to see how well they work.
As such, start with this:
SELECT
table_1.id
FROM
table_1
LEFT JOIN table_2
ON table_1.id = table_2.id
AND table_1.date_col <= table_2.date_col
WHERE
table_1.col_condition_1 = 0
AND table_1.col_condition_2 NOT IN (3, 4)
AND table_2.id is NULL
LIMIT 5000;
Logical reasoning on why this is equivalent to your query:
Your original query's WHERE statement of (table_2.id is NULL OR table_1.date_col > table_2.date_col)
can be summarized as "Only include table_1 records that either do NOT have a table_2 record, or where the table_2 record is earlier than (or equal to) the table_1 record.
My version of the query uses an anti-join to exclude all table_1 records where they exists a table_2 that is earlier than (or equal to) the table_1 record.
There are a number of possible composite indexes that may help this query. Here are a couple to start with:
For table_2: (id,date_col)
For table_1: (col_condition_1,id,date_col,col_condition_2)
Please try my query and indexes, and report the results (including EXPLAIN plan).
Try to split the existing SQL in two parts and see what are the execution times for each. This would hopefully give you what part is responsible for the slowness:
part 1:
SELECT table_1.id
FROM table_1
LEFT JOIN table_2
ON (table_1.id = table_2.id)
WHERE table_1.col_condition_1 = 0
AND table_1.col_condition_2 NOT IN (3, 4)
AND table_2.id is NULL
and part 2 (note the inner join here):
SELECT table_1.id
FROM table_1
JOIN table_2
ON (table_1.id = table_2.id)
WHERE table_1.col_condition_1 = 0
AND table_1.col_condition_2 NOT IN (3, 4)
AND table_1.date_col > table_2.date_col
I expect the part 2 would be the one to take longer. In this I think an index on both table_1 and table_2 on date_coll would help.
I don't think the composite index would help at all in your select.
This said it is hard to diagnose why the three conditions together would impact the performance that badly. It seems to be related to your data distribution. Not sure about mySql but in Oracle a statistics collections on those tables would make a difference.
Hope it helps.
OR
is a performance killer.UNION
instead of OR
can speed up the query.LIMIT
without ORDER BY
is dubious.id_UNIQUE
.INDEX(a)
is unnecessary when you also have INDEX(a,b)
.IN (1, 2)
might be faster than NOT IN (3, 4)
.