问题
I have a simple question about the most efficient way to perform a particular join.
Take these three tables, real names have been changed to protect the innocent:
Table: animal
animal_id name ... ====================== 1 bunny 2 bear 3 cat 4 mouse
Table: tags
tag_id tag ================== 1 fluffy 2 brown 3 cute 4 small
Mapping Table: animal_tag
animal_id tag_id ================== 1 1 1 2 1 3 2 2 3 4 4 2
I want to find all animals that are tagged as 'fluffy', 'brown', and 'cute'. That is to say that the animal must be tagged with all three. In reality, the number of required tags can vary, but should be irrelevant for this discussion. This is the query I came up with:
SELECT * FROM animal
JOIN (
SELECT at.animal_id FROM animal_tag at
WHERE at.tag_id IN (
SELECT tg.tag_id FROM tag tg
WHERE tg.tag='fluffy' OR tg.tag='brown' OR tg.tag='cute'
)
GROUP BY at.animal_id HAVING COUNT(at.tag_id)=3
) AS jt
ON animal.animal_id=jt.animal_id
On a table with thousands 'animals' and and hundreds of 'tags', this query performs respectably ... 10s of milliseconds. However, when i look at the query plan (Apache Derby is the DB), the optimizer's estimated cost is pretty high (9945.12) and the plan pretty extensive. For a query this "simple" I usually try to get query plans with an estimated cost of single or double digits.
So my question is, is there a better way to perform this query? Seems like a simple query, but I've been stumped coming up with anything better.
回答1:
You could create a temp table using DECLARE GLOBAL TEMPORARY TABLE And then do an INNER JOIN to eliminate the "WHERE IN". Working with Joins which are set based is usually far more efficient than Where statements that have to be evaluated for each row.
回答2:
try this:
SELECT DISTINCT f.Animal_ID, g.Name
FROM Animal f INNER JOIN
(SELECT a.Animal_ID, a.Name, COUNT(*) as iCount
FROM Animal a INNER JOIN Animal_Tag b
ON a.Animal_ID = b.animal_ID
INNER JOIN Tags c
On b.tag_ID = c.tag_ID
WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
GROUP BY a.Animal_ID) g
WHERE g.iCount = 3 -- No. of tags
UPDATE
SELECT DISTINCT a.Animal_ID, a.Name, COUNT(*) as iCount
FROM Animal a INNER JOIN Animal_Tag b
ON a.Animal_ID = b.animal_ID
INNER JOIN Tags c
On b.tag_ID = c.tag_ID
WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
GROUP BY Animal_ID
HAVING iCount = 3 -- No. of tags
回答3:
Give this a spin:
SELECT a.*
FROM animal a
INNER JOIN
(
SELECT at.animal_id
FROM tag t
INNER JOIN animal_tag at ON at.tag_id = t.tag_id
WHERE tag IN ('fluffy', 'brown', 'cute')
GROUP BY at.animal_id
HAVING count(*) = 3
) f ON a.animal_id = f.animal_id
Here is another option, just for the fun of it:
SELECT a.*
FROM animal a
INNER JOIN animal_tag at1 on at1.animal_id = a.animal_id
INNER JOIN tag t1 on t1.tag_id = at1.tag_id
INNER JOIN animal_tag at2 on at2.animal_id = a.animal_id
INNER JOIN tag t2 on t2.tag_id = at2.tag_id
INNER JOIN animal_tag at3 on at3.animal_id = a.animal_id
INNER JOIN tag t3 on t3.tag_id = at3.tag_id
WHERE t1.tag = 'fluffy' AND t2.tag = 'brown' AND t3.tag = 'cute'
I don't really expect this last option to do well... the other options avoid needing to go back to the tag table multiple times to resolve a tag name from the id... but you never know what the query optimizer will do until you try it.
回答4:
First of all, a huge thanks to everyone who jumped in on this. Ultimately the answer is, as referenced by several commenters, relational division.
While I did take a course in Codd's relational data model many moons ago, the course like many, did not really cover relational division. Unwittingly, my original query is actually an application of Relational Division.
Referring to a slide 26-27 in this presentation on relational division, my query applies the technique of comparing set cardinalities. I tried some of the other methods mentioned for applying relational division but, at least in my case, the counting method provides the fastest run-time. I encourage anyone interested in this problem to read the aforementioned slide stack, as well as the article referenced on this page by Mikael Eriksson. Again, thanks to everyone.
回答5:
I was wondering how bad it would be to use a relational division there. Can you please give it a run? I know this will take more, but I'm intrigued by how much :) If you can provide both the estimated cost and the time, it would be great.
select a2.animal_id, a2.animal_name from animal2 a2
where not exists (
select * from animal1 a1, tags t1
where not exists (
select * from animal_tag at1
where at1.animal_id = a1.animal_id and at1.animal_tag = t1.tag_id
) and a2.animal_id = a1.animal_id and t1.tag in ('fluffy', 'brown', 'cute')
)
Now looking for a fast query, I can't think in any faster than john's or yours. Actually john's might be a little slower than yours because he's performing unnencesary operations (remove distinct and remove count(*) from select):
SELECT a.Animal_ID, a.Name FROM Animal a
INNER JOIN Animal_Tag b ON a.Animal_ID = b.animal_ID
INNER JOIN Tags c On b.tag_ID = c.tag_ID
WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
GROUP BY Animal_ID, a.Name
HAVING count(*) = 3 -- No. of tags
This should be as fast as yours.
PS: Is there any way to remove that damned 3 without duplicating the where clause? My brain is boiling :)
来源:https://stackoverflow.com/questions/9170101/join-between-mapping-junction-table-with-specific-cardinality