EDIT: I removed the GROUP BY
clause from the example queries but the same problem shows \"When I join table x to an empty/1 row table y MySQL makes a full table
After some tests it turns out that if the second table(user_school_mm
) has some data MySQL will not make full table scan on the first table, and if the second table(country
) has no data/very little data (1 or 2 records) MySQL will do a full table scan. Why this happens? I don't know.
How to reproduce
1- Create a schema like this
CREATE TABLE `event` (
`ev_id` int(11) NOT NULL AUTO_INCREMENT,
`ev_note` varchar(255) DEFAULT NULL,
PRIMARY KEY (`ev_id`)
) ENGINE=InnoDB;
CREATE TABLE `table1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(45) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB ;
CREATE TABLE `table2` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(45) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB ;
2- insert in the main table (event
in this case) some data (I filled it with 35601000 rows)
3- leave table1 empty and insert 15 rows in table2
insert into table2 (name) values
('fooBar'),('fooBar'),('fooBar'),('fooBar'),('fooBar'),
('fooBar'),('fooBar'),('fooBar'),('fooBar'),('fooBar'),
('fooBar'),('fooBar'),('fooBar'),('fooBar'),('fooBar');
4- now join the main table with table2 and retest the same query with table1
Query 1 (Fast)
select *
from
event left join
table2 on event.ev_id = table2.id
order by event.ev_id
limit 2;
-- executed in 300 milliseconds measured by the client
Explain
+---+-----------+--------+------+----------------+--------+---------+------------------+------+--------+
|id |select_type|table | type | possible_keys | key | key_len | ref | rows | Extra |
+---+-----------+--------+------+----------------+--------+---------+------------------+------+--------+
|1 |SIMPLE |event |index | |PRIMARY |4 | | 2 | |
|1 |SIMPLE |table2 |eq_ref|PRIMARY |PRIMARY |4 |tests.event.ev_id | 1 | |
+---+-----------+--------+------+----------------+--------+---------+------------------+------+--------+
Query 2 (Slow)
select *
from
event left join
table1 on event.ev_id = table1.id
order by event.ev_id
limit 2;
-- executed in 79 seconds measured by the client
Explain
+---+-----------+--------+------+----------------+--------+---------+-------+---------+---------------------------------------------------+
|id |select_type|table | type | possible_keys | key | key_len | ref | rows | Extra |
+---+-----------+--------+------+----------------+--------+---------+-------+---------+---------------------------------------------------+
|1 |SIMPLE |event |ALL | | | | |33506704 | Using temporary; Using filesort |
|1 |SIMPLE |table1 |ALL |PRIMARY | | | |1 | Using where; Using join buffer (Block Nested Loop)|
+---+-----------+--------+------+----------------+--------+---------+-------+---------+---------------------------------------------------+
MySQL version is 5.6.38
The MySQL optimizer will decide on join order/method first, and then check whether, for the chosen join order, it is possible to avoid sorting by using an index. For the slow query in this question, the optimizer has decided to use Block-Nested-Loop (BNL) join.
BNL is usually quicker than using an index when one of the tables is very small (and there is no LIMIT).
However, with BNL, rows will not necessarily come in the order given by the first table. Hence, the result of the join needs to be sorted before applying the LIMIT.
You can turn off BNL by set optimizer_switch = 'block_nested_loop=off';
The main reason is the misuse of GROUP BY
. Let's take the first query. Even though it is "fast", it is still "wrong":
SELECT *
FROM users
LEFT JOIN user_school_mm on users.id = user_school_mm.user_id
GROUP BY users.id
ORDER BY users.id ASC
LIMIT 2
A user can go to two schools. The use of the many:many mapping user_school_mm
claims that is a possibility. So, after doing the JOIN
, you get 2 rows for a single user. But then, you GROUP BY users.id
, to boil it down to a single row. But... Which of the two school_id values should you use??
I am not going to try to address the performance issues until you present queries that make sense. At that point it will be easier to point out why one query performs better than another.