Assuming I have the tables student
, club
, and student_club
:
student {
id
name
}
club {
id
name
}
stude
I tested in a real life db, so the names differ from the catskin list. It's a backup copy, so nothing changed during all test runs (except minor changes to the catalogs).
SELECT a.*
FROM ef.adr a
JOIN (
SELECT adr_id
FROM ef.adratt
WHERE att_id IN (10,14)
GROUP BY adr_id
HAVING COUNT(*) > 1) t using (adr_id);
Merge Join (cost=630.10..1248.78 rows=627 width=295) (actual time=13.025..34.726 rows=67 loops=1)
Merge Cond: (a.adr_id = adratt.adr_id)
-> Index Scan using adr_pkey on adr a (cost=0.00..523.39 rows=5767 width=295) (actual time=0.023..11.308 rows=5356 loops=1)
-> Sort (cost=630.10..636.37 rows=627 width=4) (actual time=12.891..13.004 rows=67 loops=1)
Sort Key: adratt.adr_id
Sort Method: quicksort Memory: 28kB
-> HashAggregate (cost=450.87..488.49 rows=627 width=4) (actual time=12.386..12.710 rows=67 loops=1)
Filter: (count(*) > 1)
-> Bitmap Heap Scan on adratt (cost=97.66..394.81 rows=2803 width=4) (actual time=0.245..5.958 rows=2811 loops=1)
Recheck Cond: (att_id = ANY ('{10,14}'::integer[]))
-> Bitmap Index Scan on adratt_att_id_idx (cost=0.00..94.86 rows=2803 width=0) (actual time=0.217..0.217 rows=2811 loops=1)
Index Cond: (att_id = ANY ('{10,14}'::integer[]))
Total runtime: 34.928 ms
WITH two AS (
SELECT adr_id
FROM ef.adratt
WHERE att_id IN (10,14)
GROUP BY adr_id
HAVING COUNT(*) > 1
)
SELECT a.*
FROM ef.adr a
JOIN two using (adr_id);
Hash Join (cost=1161.52..1261.84 rows=627 width=295) (actual time=36.188..37.269 rows=67 loops=1)
Hash Cond: (two.adr_id = a.adr_id)
CTE two
-> HashAggregate (cost=450.87..488.49 rows=627 width=4) (actual time=13.059..13.447 rows=67 loops=1)
Filter: (count(*) > 1)
-> Bitmap Heap Scan on adratt (cost=97.66..394.81 rows=2803 width=4) (actual time=0.252..6.252 rows=2811 loops=1)
Recheck Cond: (att_id = ANY ('{10,14}'::integer[]))
-> Bitmap Index Scan on adratt_att_id_idx (cost=0.00..94.86 rows=2803 width=0) (actual time=0.226..0.226 rows=2811 loops=1)
Index Cond: (att_id = ANY ('{10,14}'::integer[]))
-> CTE Scan on two (cost=0.00..50.16 rows=627 width=4) (actual time=13.065..13.677 rows=67 loops=1)
-> Hash (cost=384.68..384.68 rows=5767 width=295) (actual time=23.097..23.097 rows=5767 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1153kB
-> Seq Scan on adr a (cost=0.00..384.68 rows=5767 width=295) (actual time=0.005..10.955 rows=5767 loops=1)
Total runtime: 37.482 ms
SELECT *
FROM student
WHERE id IN (SELECT student_id
FROM student_club
WHERE club_id = 30
INTERSECT
SELECT student_id
FROM student_club
WHERE club_id = 50)
Or a more general solution easier to extend to n
clubs and that avoids INTERSECT
(not available in MySQL) and IN
(as performance of this sucks in MySQL)
SELECT s.id,
s.name
FROM student s
join student_club sc
ON s.id = sc.student_id
WHERE sc.club_id IN ( 30, 50 )
GROUP BY s.id,
s.name
HAVING COUNT(DISTINCT sc.club_id) = 2
So there's more than one way to skin a cat.
I'll to add two more to make it, well, more complete.
Assuming a sane data model where (student_id, club_id)
is unique in student_club
.
Martin Smith's second version is like somewhat similar, but he joins first, groups later. This should be faster:
SELECT s.id, s.name
FROM student s
JOIN (
SELECT student_id
FROM student_club
WHERE club_id IN (30, 50)
GROUP BY 1
HAVING COUNT(*) > 1
) sc USING (student_id);
And of course, there is the classic EXISTS
. Similar to Derek's variant with IN
. Simple and fast. (In MySQL, this should be quite a bit faster than the variant with IN
):
SELECT s.id, s.name
FROM student s
WHERE EXISTS (SELECT 1 FROM student_club
WHERE student_id = s.student_id AND club_id = 30)
AND EXISTS (SELECT 1 FROM student_club
WHERE student_id = s.student_id AND club_id = 50);
@erwin-brandstetter Please, benchmark this:
SELECT s.stud_id, s.name
FROM student s, student_club x, student_club y
WHERE x.club_id = 30
AND s.stud_id = x.stud_id
AND y.club_id = 50
AND s.stud_id = y.stud_id;
It's like number 6) by @sean , just cleaner, I guess.
-- EXPLAIN ANALYZE
WITH two AS (
SELECT c0.student_id
FROM tmp.student_club c0
, tmp.student_club c1
WHERE c0.student_id = c1.student_id
AND c0.club_id = 30
AND c1.club_id = 50
)
SELECT st.* FROM tmp.student st
JOIN two ON (two.student_id=st.id)
;
The query plan:
Hash Join (cost=1904.76..1919.09 rows=337 width=15) (actual time=6.937..8.771 rows=324 loops=1)
Hash Cond: (two.student_id = st.id)
CTE two
-> Hash Join (cost=849.97..1645.76 rows=337 width=4) (actual time=4.932..6.488 rows=324 loops=1)
Hash Cond: (c1.student_id = c0.student_id)
-> Bitmap Heap Scan on student_club c1 (cost=32.76..796.94 rows=1614 width=4) (actual time=0.667..1.835 rows=1646 loops=1)
Recheck Cond: (club_id = 50)
-> Bitmap Index Scan on sc_club_id_idx (cost=0.00..32.36 rows=1614 width=0) (actual time=0.473..0.473 rows=1646 loops=1)
Index Cond: (club_id = 50)
-> Hash (cost=797.00..797.00 rows=1617 width=4) (actual time=4.203..4.203 rows=1620 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 57kB
-> Bitmap Heap Scan on student_club c0 (cost=32.79..797.00 rows=1617 width=4) (actual time=0.663..3.596 rows=1620 loops=1)
Recheck Cond: (club_id = 30)
-> Bitmap Index Scan on sc_club_id_idx (cost=0.00..32.38 rows=1617 width=0) (actual time=0.469..0.469 rows=1620 loops=1)
Index Cond: (club_id = 30)
-> CTE Scan on two (cost=0.00..6.74 rows=337 width=4) (actual time=4.935..6.591 rows=324 loops=1)
-> Hash (cost=159.00..159.00 rows=8000 width=15) (actual time=1.979..1.979 rows=8000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 374kB
-> Seq Scan on student st (cost=0.00..159.00 rows=8000 width=15) (actual time=0.093..0.759 rows=8000 loops=1)
Total runtime: 8.989 ms
(20 rows)
So it still seems to want the seq scan on student.
SELECT s.*
FROM student s
INNER JOIN student_club sc_soccer ON s.id = sc_soccer.student_id
INNER JOIN student_club sc_baseball ON s.id = sc_baseball.student_id
WHERE
sc_baseball.club_id = 50 AND
sc_soccer.club_id = 30