问题
Using Postgres I have a schema that has conversations
and conversationUsers
. Each conversation
has many conversationUsers
. I want to be able to find the conversation that has the exactly specified number of conversationUsers
. In other words, provided an array of userIds
(say, [1, 4, 6]
) I want to be able to find the conversation that contains only those users, and no more.
So far I've tried this:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;
Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId"
5).
回答1:
This is a case of relational-division - with the added special requirement that the same conversation shall have no additional users.
Assuming is the PK of table "conversationUsers"
which enforces uniqueness of combinations, NOT NULL
and also provides the index essential for performance implicitly. Columns of the multicolumn PK in this order! Else you have to do more.
About the order of index columns:
- Is a composite index also good for queries on the first field?
For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:
SELECT "conversationId"
FROM "conversationUsers" c
WHERE "userId" = ANY ('{1,4,6}'::int[])
GROUP BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = c."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
Eliminating conversations with additional users with a NOT EXISTS
anti-semi-join. More:
- How do I (or can I) SELECT DISTINCT on multiple columns?
Alternative techniques:
- Select rows which are not present in other table
There are various other, (much) faster relational-division query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.
- How to filter SQL results in a has-many-through relation
For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = ('{1,4,6}'::int[])[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length(('{1,4,6}'::int[]), 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
For ease of use wrap this in a function or prepared statement. Like:
PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = $1[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = $1[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length($1, 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL($1);
Call:
EXECUTE conversations('{1,4,6}');
db<>fiddle here (also demonstrating a function)
There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...
More explanation:
- Using same column multiple times in WHERE clause
Alternative: MV for sparsely written table
If the table "conversationUsers"
is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW with pre-aggregated users in sorted arrays and create a plain btree index on that array column.
CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users -- sorted array
FROM (
SELECT "conversationId", "userId"
FROM "conversationUsers"
ORDER BY 1, 2
) sub
GROUP BY 1
ORDER BY 1;
CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");
The demonstrated covering index requires Postgres 11. See:
- https://dba.stackexchange.com/a/207938/3684
About sorting rows in a subquery:
- How to apply ORDER BY and LIMIT in combination with an aggregate function?
In older versions use a plain multicolumn index on (users, "conversationId")
. With very long arrays, a hash index might make sense in Postgres 10 or later.
Then the much faster query would simply be:
SELECT "conversationId"
FROM mv_conversation_users c
WHERE users = '{1,4,6}'::int[]; -- sorted array!
db<>fiddle here
You have to weigh added costs to storage, writes and maintenance against benefits to read performance.
Aside: consider legal identifiers without double quotes. conversation_id
instead of "conversationId"
etc.:
- Are PostgreSQL column names case-sensitive?
回答2:
you can modify your query like this and it should work:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."conversationId" IN (
SELECT DISTINCT c1."conversationId"
FROM "conversationUsers" c1
WHERE c1."userId" IN (1, 4)
)
GROUP BY c."conversationId"
HAVING COUNT(DISTINCT c."userId") = 2;
回答3:
This might be easier to follow. You want the conversation ID, group by it. add the HAVING clause based on the sum of matching user IDs count equal to the all possible within the group. This will work, but will be longer to process because of no pre-qualifier.
select
cu.ConversationId
from
conversationUsers cu
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )
To Simplify the list even more, have a pre-query of conversations that at least one person is in... If they are not in to begin with, why bother considering such other conversations.
select
cu.ConversationId
from
( select cu2.ConversationID
from conversationUsers cu2
where cu2.userID = 4 ) preQual
JOIN conversationUsers cu
preQual.ConversationId = cu.ConversationId
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )
来源:https://stackoverflow.com/questions/53235353/sql-query-to-find-a-row-with-a-specific-number-of-associations