I think I found the answer to my question, I\'m just unsure of the syntax, I keep getting SQL errors.
Basically, I want to do the opposite of IN. Take this example:<
Ok, stating the problem again.
"Find users that have entries in the tags table for both tag1 and tag2". This means at least 2 rows in the child tags table for each user table entry
Solution 1: The intersection of "users with tag1" and "users with tag2"
SELECT u.*
FROM
users u INNER JOIN
(
SELECT user_id FROM tags WHERE name = 'tag1'
INTERSECT
SELECT user_id FROM tags WHERE name = 'tag2'
) t ON u.id = t.user_id
Solution 2: EXISTS
SELECT u.*
FROM
users u
WHERE
EXISTS (SELECT * FROM tags t1 WHERE t1.name = 'tag1'
AND u.id = t1.user_id)
AND
EXISTS (SELECT * FROM tags t2 WHERE t2.name = 'tag2'
AND u.id = t2.user_id)
Solution 3: JOIN
SELECT u.* FROM
users u
INNER JOIN
tags as t1 on t1.user_id = u.id
INNER JOIN
tags as t2 on t2.user_id = u.id
WHERE
t1.name='tag1' AND t2.name='tag2'
Solution 4: IN
SELECT u.*
FROM
users u
WHERE
u.id (SELECT t1.user_id FROM tags t1 WHERE t1.name = 'tag1')
AND
u.id (SELECT t2.user_id FROM tags t2 WHERE t2.name = 'tag2')
All The EXISTS, INTERSECT and IN should give the same execution plan in SQL Server
Now, these are all for the case where you are looking for 2 tags. As you want more tags, they become cumbersome so use shahkalpesh's solution.
However, I'd modify it so the tags are in a table and no extra OR clauses are needed
SELECT u.*
FROM
Users u
Inner join
tags t ON t.user_id = u.id
JOIN
@MyTags mt ON t.name = mt.name
GROUP BY u.*
HAVING count(tags.*) = COUNT(DISTINCT mt.name)
Let's talk about this problem in generalities first, then specifics.
In this problem what you want to do is select rows from table A depending on conditions in two (or for the general case, more than two) rows in table B. In order to accomplish this, you need to do one of two things:
execute tests against different rows in table B
aggregate the rows of interest from table B into a single row which somehow contains the information you need to test the original rows from table B
This kind of problem is the big reason why, I think, you see people creating comma-delimited lists in VARCHAR fields instead of normalizing their databases correctly.
In your example, you want to select user
rows based on the existence of rows matching two specific conditions in tags
.
There are three ways you can use technique (1) (testing different rows). They are using EXISTS, using sub-queries, and using JOINs:
1A. Using EXISTs is (in my opinion, anyway) clear because it matches what you're trying to do — checking for existence of rows. This is moderately scalable to more tags in terms of writing the SQL creation if you're generating dynamic SQL, you simple add an additional AND EXISTS clause for each tag (performance, of course, will suffer):
SELECT * FROM users WHERE
EXISTS (SELECT * FROM tags WHERE user_id = users.id AND name ='tag1') AND
EXISTS (SELECT * FROM tags WHERE user_id = users.id AND name ='tag2')
I think this clearly expresses the intent of the query.
1B Using sub-queries is also pretty clear. Because this technique does not involve correlated sub-queries some engines can optimize it better (it depends, in part, on the number of users with any given tag):
SELECT * FROM users WHERE
id IN (SELECT user_id FROM tags WHERE name ='tag1') AND
id IN (SELECT user_id FROM tags WHERE name ='tag2')
This scales the same way that option 1A does. It's also (to me, anyway) pretty clear.
1C Using JOINs involves INNER JOINing the tags table to the users table once for each tag. It doesn't scale as well because it's harder (but still possible) to generate the dynamic SQL:
SELECT u.* FROM users u
INNER JOIN tags t1 ON u.id = t1.user_id
INNER JOIN tags t2 ON u.id = t2.user_id
WHERE t1.name = 'tag1' AND t2.name = 'tag2'
Personally, I feel this is considerably less clear than the other two options since it looks like the goal is to create a JOINed record set rather than filter the users table. Also, scalability suffers because you need to add INNER JOINs and change the WHERE clause. Note that this technique sort of straddles techniques 1 and 2 because it uses the JOINs to aggregate two rows from tags.
There are a two main ways of doing this, using COUNTs and using string processing:
2A Using COUNTs is much easier if your tags table is "protected" against having the same tag applied twice to the same user. You can do this by making (user_id, name) the PRIMARY KEY in tags, or by creating a UNIQUE INDEX on those two columns. If the rows are protected in that way you can do this:
SELECT users.id, users.user_name
FROM users INNER JOIN tags ON users.id = tags.user_id
WHERE tags.name IN ('tag1', 'tag2')
GROUP BY users.id, users.user_name
HAVING COUNT(*) = 2
In this case you match the HAVING COUNT(*) = test value against the number of tags name in the IN clause. This does not work if each tag can be applied to a user more than once because the count of 2 could be produced by two instances of 'tag1' and none of 'tag2' (and the row would qualify where it shouldn't) or two instances of 'tag1' plus one instance of 'tag2' would create a count of 3 (and the user would not qualify even though they should).
Note that this is the most scalable technique performance-wise since you can add additional tags and no additional queries or JOINs are needed.
If multiple tags are allowed you can perform an inner aggregation to remove the duplicates. You can do this in the same query I showed above, but for simplicity sake I'm going to break the logic out into a separate view:
CREATE VIEW tags_dedup (user_id, name) AS
SELECT DISTINCT user_id, name FROM tags
and then you go back to the query above and substitute tags_dedup for tags.
2B Using String processing is database specific because there is no standard SQL aggregate function to produce string lists from multiple rows. Some databases, however, offer extensions to do this. In MySQL, you can use GROUP_CONCAT and FIND_IN_SET to do this:
SELECT user.id, users.user_name, GROUP_CONCAT(tags.name) as all_tags
FROM users INNER JOIN tags ON users.id = tags.user_id
GROUP BY users.id, users.user_name
HAVING FIND_IN_SET('tag1', all_tags) > 0 AND
FIND_IN_SET('tag2', all_tags) > 0
Note, this is very inefficient and uses MySQL unique extensions.
Try the following:
SELECT *
FROM users u, tags t1, tags t2
WHERE t1.user_id = t2.user_id
AND t1.name = 'tag1'
AND t2.name = 'tag2'
AND t1.user_id = u.id
Obviously, for a large number of tags, the performance of this query will be severely degraded.
What about
SELECT * FROM users, tags WHERE tags.user_id = users.user_id AND tags.name = 'tag1'
INTERSECT
SELECT * FROM users, tags WHERE tags.user_id = users.user_id AND tags.name = 'tag2'
Try
WHERE tags.name IN ('tag1') and tags.name IN ('tag2');
Not super efficient, but probably one of many ways.
I would do exactly what you are doing first, because that gets a list of all users with 'tag1' and a list of all users with 'tag2', but in the same response obviously. So, we have to add some more:
Do a group by users
(or users.id) and then having count(*) == 2
. That will group duplicate users (which means those with both tag1 and tag2) and then the having-part will remove the ones with just one of the two tags.
This solution avoids adding yet another join-statement, but honestly I'm not sure which is faster. People, feel free to comment on the performance-part :)
EDIT: Just to make it easier to try out, here's the whole thing:
SELECT *
FROM users INNER JOIN
tags ON tags.user_id = users.id
WHERE tags.name = 'tag1' OR tags.name = 'tag2'
GROUP BY users.id
HAVING COUNT(*) = 2