问题
The Scenario
Let's suppose we have a set of database tables that represent four key concepts:
- Entity Types (e.g. account, client, etc.)
- Entities (e.g. instances of the above Entity Types)
- Cohorts (a named group)
- Cohort Members (the Entities that form up the membership of a Cohort)
The rules around Cohorts are:
- A Cohort always has at least one Cohort Member.
- A Cohorts Members must be unique to that Cohort (i.e. Entity 5 cannot be a member of Cohort 3 twice, though it could be a member of Cohort 3 and Cohort 4)
- No two Cohorts will ever be entirely equal in membership, though one Cohort may legitimately be a subset of another Cohort.
The rules around Entities are:
- No two Entities may have the same value pair
(business_key, entity_type_id)
- Two entities with a different
entity_type_id
may share abusiness_key
Because pictures tell a thousand lines of code, here is the ERD:
The Question
I want a SQL query that, when provided a collection of (business_key, entity_type_id)
pairs, will search for a Cohort that matches exactly, returning one row with just the cohort_id if that Cohort exists, and zero rows otherwise.
i.e. - if the set of Entities matchesentity_ids
1 and 2, it will only return a cohort_id
where the cohort_members
are exactly 1 and 2, not just 1, not just 2, not a cohort with entity_ids
1 2 and 3. If no cohort exists that satisfies this, then zero rows are returned.
The Test Cases
To help people addressing the question, I have created a fiddle of the tables along with some data that defines various Entity Types, Entities, and Cohorts. There is also a table with test data for matching, named test_cohort
. It contains 6 test cohorts which test various scenarios. The first 5 tests should exactly match just one cohort. The 6th test is a bogus one to test the zero-row clause. When using the test table, the associated INSERT
statement should just have one line uncommented (see fiddle, it's set up like that initially):
http://sqlfiddle.com/#!18/2d022
My attempt in SQL is the following, though it fails tests #2 and #4 (which can be found in the fiddle):
SELECT actual_cohort_member.cohort_id
FROM test_cohort
INNER JOIN entity
ON entity.business_key = test_cohort.business_key
AND entity.entity_type_id = test_cohort.entity_type_id
INNER JOIN cohort_member AS existing_potential_member
ON existing_potential_member.entity_id = entity.entity_id
INNER JOIN cohort
ON cohort.cohort_id = existing_potential_member.cohort_id
RIGHT OUTER JOIN cohort_member AS actual_cohort_member
ON actual_cohort_member.cohort_id = cohort.cohort_id
AND actual_cohort_member.cohort_id = existing_potential_member.cohort_id
AND actual_cohort_member.entity_id = existing_potential_member.entity_id
GROUP BY actual_cohort_member.cohort_id
HAVING
SUM(CASE WHEN
actual_cohort_member.cohort_id = existing_potential_member.cohort_id AND
actual_cohort_member.entity_id = existing_potential_member.entity_id THEN 1 ELSE 0
END) = COUNT(*)
;
回答1:
This scenario can be achieve by adding compound condition in the WHERE
clause since you're comparing to a pair value. Then you have to count the result based from the conditions set in the WHERE
clause as well as the total rows by of the cohort_id
.
SELECT c.cohort_id
FROM cohort c
INNER JOIN cohort_member cm
ON c.cohort_id = cm.cohort_id
INNER JOIN entity e
ON cm.entity_id = e.entity_id
WHERE (e.entity_type_id = 1 AND e.business_key = 'acc1') -- condition here
OR (e.entity_type_id = 1 AND e.business_key = 'acc2')
GROUP BY c.cohort_id
HAVING COUNT(*) = 2 -- number must be the same to the total number of condition
AND (SELECT COUNT(*)
FROM cohort_member cm2
WHERE cm2.cohort_id = c.cohort_id) = 2 -- number must be the same to the total number of condition
- Test Case #1
- Test Case #2
- Test Case #3
- Test Case #4
- Test Case #5
- Test Case #6
As you can see in the test cases above, the value in the filter depends on the number of conditions in the WHERE
clause. It would be advisable to create a dynamic query on this.
UPDATE
If the table test_cohort
contains only one scenario, then this will suffice your requirement, however, if test_cohort
contains list of scenarios then you might want to look in the other answer since this solution does not alter any table schema.
SELECT c.cohort_id
FROM cohort c
INNER JOIN cohort_member cm
ON c.cohort_id = cm.cohort_id
INNER JOIN entity e
ON cm.entity_id = e.entity_id
INNER JOIN test_cohort tc
ON tc.business_key = e.business_key
AND tc.entity_type_id = e.entity_type_id
GROUP BY c.cohort_id
HAVING COUNT(*) = (SELECT COUNT(*) FROM test_cohort)
AND (SELECT COUNT(*)
FROM cohort_member cm2
WHERE cm2.cohort_id = c.cohort_id) = (SELECT COUNT(*) FROM test_cohort)
- Test Case #1
- Test Case #2
- Test Case #3
- Test Case #4
- Test Case #5
- Test Case #6
回答2:
I have added a column i
to your test_cohort
table, so that you can test all your scenarios at the same time. Here is a DDL
CREATE TABLE test_cohort (
i int,
business_key NVARCHAR(255),
entity_type_id INT
);
INSERT INTO test_cohort VALUES
(1, 'acc1', 1), (1, 'acc2', 1) -- TEST #1: should match against cohort 1
,(2, 'cli1', 2), (2, 'cli2', 2) -- TEST #2: should match against cohort 2
,(3, 'cli1', 2) -- TEST #3: should match against cohort 3
,(4, 'acc1', 1), (4, 'acc2', 1), (4, 'cli1', 2), (4, 'cli2', 2) -- TEST #4: should match against cohort 4
,(5, 'acc1', 1), (5, 'cli2', 2) -- TEST #5: should match against cohort 5
,(6, 'acc1', 3), (6, 'cli2', 3) -- TEST #6: should not match any cohort
And the query:
select
c.i, m.cohort_id
from
(
select
*, cnt = count(*) over (partition by i)
from
test_cohort
) c
join entity e on c.entity_type_id = e.entity_type_id and c.business_key = e.business_key
join (
select
*, cnt = count(*) over (partition by cohort_id)
from
cohort_member
) m on e.entity_id = m.entity_id and c.cnt = m.cnt
group by m.cohort_id, c.cnt, c.i
having count(*) = c.cnt
Output
i cohort_id
------------
1 1
2 2
3 3
4 4
5 5
The idea is to count number of rows before join. And compare by exact match
来源:https://stackoverflow.com/questions/48699160/find-id-of-parent-where-all-children-exactly-match