问题
I have the following tables:
work_units
- self explanatoryworkers
- self explanatoryskills
- every work unit requires a number of skills if you want to work on it. Every worker is proficient in a number of skills.work_units_skills
- join tableworkers_skills
- join table
A worker can request the next appropriate free highest priority (whatever that means) unit of work to be assigned to her.
Currently I have:
SELECT work_units.*
FROM work_units
-- some joins
WHERE NOT EXISTS (
SELECT skill_id
FROM work_units_skills
WHERE work_unit_id = work_units.id
EXCEPT
SELECT skill_id
FROM workers_skills
WHERE worker_id = 1 -- the worker id that made the request
)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;
This condition makes the query 8-10 times slower though.
Is there a better way to express that a work_units
's skills should be a subset of the workers
's skills or something to improve the current query?
Some more context:
- The
skills
table is fairly small. - Both
work_units
andworkers
tend to have very few associated skills. work_units_skills
has index onwork_unit_id
.- I tried moving the query on
workers_skills
into a CTE. This gave a slight improvement (10-15%), but it's still too slow. - A work unit with no skill can be picked up by any user. Aka an empty set is a subset of every set.
回答1:
One simple speed-up would be to use EXCEPT ALL instead of EXCEPT
. The latter removes duplicates, which is unnecessary here and can be slow.
An alternative that would probably be faster is to use a further NOT EXISTS
instead of the EXCEPT
:
...
WHERE NOT EXISTS (
SELECT skill_id
FROM work_units_skills wus
WHERE work_unit_id = work_units.id
AND NOT EXISTS (
SELECT skill_id
FROM workers_skills ws
WHERE worker_id = 1 -- the worker id that made the request
AND ws.skill_id = wus.skill_id
)
)
Demo
http://rextester.com/AGEIS52439 - with the the LIMIT
removed for testing
回答2:
(see UPDATE below)
This query finds a good work_unit
using a simple LEFT JOIN to find a missing skill in the shorter table of skills the requesting worker has. The trick is whenever there is a missing skill, there will be a NULL value in the join and this is translated to a 1
and the work_unit
is removed by leaving the ones with all 0
values i.e. having a max
of 0
.
Being classic SQL this would be the most heavily targeted query for optimization by the engine:
SELECT work_unit_id
FROM
work_units_skills s
LEFT JOIN
(SELECT skill_id FROM workers_skills WHERE worker_id = 1) t
ON (s.skill_id=t.skill_id)
GROUP BY work_unit_id
HAVING max(CASE WHEN t.skill_id IS NULL THEN 1 ELSE 0 END)=0
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;
UPDATE
In order to catch work_units
with no skills, we throw the work_units
table into the JOIN:
SELECT r.id AS work_unit_id
FROM
work_units r
LEFT JOIN
work_units_skills s ON (r.id=s.work_unit_id)
LEFT JOIN
(SELECT skill_id FROM workers_skills WHERE worker_id = 1) t
ON (s.skill_id=t.skill_id)
GROUP BY r.id
HAVING bool_or(s.skill_id IS NULL) OR bool_and(t.skill_id IS NOT NULL)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;
回答3:
You may use the following query
SELECT wu.*
FROM work_units wu
LEFT JOIN work_units_skills wus ON wus.work_unit_id = wu.id and wus.skill_id IN (
SELECT id
FROM skills
EXCEPT
SELECT skill_id
FROM workers_skills
WHERE worker_id = 1 -- the worker id that made the request
)
WHERE wus.work_unit_id IS NULL;
demo (thanks, Steve Chambers for most of the data)
You should definitely have index on work_units_skills(skill_id)
, workers_skills(worker_id)
and work_units(id)
.
If you want to speed it up, even more, create indexes work_units_skills(skill_id, work_unit_id)
and workers_skills(worker_id, skill_id)
which avoid accessing those tables.
The subquery is independent and outer join should relatively fast if the result is not large.
回答4:
Bit-Mask Solution
Without any changes in your previous Database Design, just add 2 fields.
First: a long or bigint (related to your DBMS) into Workers
Second: another long or bigint into Work_Units
These fields show skills of work_units and skills of workers. For example suppose that you have 8 records in Skills table.
(notice that records of skill in small)
1- some skill 1
2- some skill 2
...
8- some skill 8
then if we want to set skills 1,3,6,7 to one work_unit, just use this number 01100101.
(I offer to use reversed version of binary 0,1 placement to support additional skills in future.)
In practice you can use 10 base number to add in database (101 instead of 01100101)
Similar number can be generated to workers. Any worker choose some skills. So we can turn the selected items into a number and save it in additional field in Worker table.
Finally, to find appropriate work_units subset for any worker JUST select from work_units and use bitwise AND like below.
A: new_field_of_specific_worker (shows Skills of each Worker) that we are searching works_units related to him/her right now.
B: new_field_of_work_units that shows Skills of each work_unit
select * from work_units
where A & B = B
Notice:
1: absolutely, this is fastest way but it has some difficulties.
2: we have some extra difficulties when a new skill is Added or to be Delete. But this is a trade-off. Adding or Deleting new skills less happens.
3: we should use skills and work_unit_skills and workers_skills too. But in search, we just use new fields
Also, this approach can be used for TAG Management systems like Stack Overflow TAGs.
回答5:
With the current info I can only answer on a hunch. Try removing the EXCEPT-statement and see if it gets significantly faster. If it does, you can add that part again, but using WHERE-conditions. In my experience set operators (MINUS/EXCEPT, UNION, INTERSECT) are quite the performance killers.
回答6:
The correlated sub-query is punishing you, especially with the additional use of EXCEPT.
To paraphrase your query, you're only interested in a work_unit_id
when a specified worker has ALL of that work_unit's skills? (If a work_unit has a skill associated with it, but the specified user doesn't have that skill, exclude that work_unit?)
This can be achieve with JOINs and GROUP BY, and no need for correlation at all.
SELECT
work_units.*
FROM
work_units
--
-- some joins
--
INNER JOIN
(
SELECT
wus.work_unit_id
FROM
work_unit_skills wus
LEFT JOIN
workers_skills ws
ON ws.skill_id = wus.skill_id
AND ws.worker_id = 1
GROUP BY
wus.work_unit_id
HAVING
COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
applicable_work_units
ON applicable_work_units.work_unit_id = work_units.id
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
The sub-query compares a worker's skill set to each work unit's skill set. If there are any skills the work unit has that the worker doesn't then ws.skill_id
will be NULL
for that row, and as NULL
is ignored by COUNT()
this means that COUNT(ws.skill_id)
will then be lower than COUNT(wus.skill_id)
, and so that work_unit
would become excluded from the sub-query's results.
This assumes that the workers_skills
table is unique over (work_id, skill_id)
and that the work_unit_skills
table is unique over (work_unit_id, skill_id)
. If that's not the case then you may want to tinker with the HAVING
clause (such as COUNT(DISTINT wus.skill_id)
, etc).
EDIT:
The above query assumes that only relatively low number of work units would match the criteria of matching a specific worker.
If you assume that a relatively large number of work units would match, the opposite logic would be faster.
(Essentially, try to make the number of rows returns by the sub-query as low as possible.)
SELECT
work_units.*
FROM
work_units
--
-- some joins
--
LEFT JOIN
(
SELECT
wus.work_unit_id
FROM
work_unit_skills wus
LEFT JOIN
workers_skills ws
ON ws.skill_id = wus.skill_id
AND ws.worker_id = 1
WHERE
ws.skill_id IS NULL
GROUP BY
wus.work_unit_id
)
excluded_work_units
ON excluded_work_units.work_unit_id = work_units.id
WHERE
excluded_work_units.work_unit_id IS NULL
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
This one compares all work unit skills with those of the worker, and only keeps rows where the work unit has skills that the worker does not have.
Then, GROUP BY
the work unit to get a list of work units that need to be ignored.
By LEFT
joining these on to your existing results, you can stipulate you only want to include a work unit if it doesn't appear in the sub-query by specifying excluded_work_units.work_unit_id IS NULL
.
Useful online guides will refer to anti-join
and anti-semi-join
.
EDIT:
In general I would recommend against the use of a bit-mask.
Not because it's slow, but because it defies normalisation. The existence of a single field representing multiple items of data is a general sql-code-smell / sql-anti-pattern, as the data is no longer atomic. (This leads to pain down the road, especially if you reach a world where you have so many skills that they no longer all fit in to the data type chosen for the bit-mask, or when it comes to managing frequent or complex changes to the skill sets.)
That said, if performance continues to be an issue, de-normalisation is often a very useful option. I'd recommend keeping the bit masks in separate tables to make it clear that they're de-normalised / cached calcualtion results. In general though, such options should be a last resort rather than a first reaction.
EDIT: Example revisions to always include work_units that have no skills...
SELECT
work_units.*
FROM
work_units
--
-- some joins
--
INNER JOIN
(
SELECT
w.id AS work_unit_id
FROM
work_units w
LEFT JOIN
work_units_skills wus
ON wus.work_unit_id = w.id
LEFT JOIN
workers_skills ws
ON ws.skill_id = wus.skill_id
AND ws.worker_id = 1
GROUP BY
w.id
HAVING
COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
applicable_work_units
ON applicable_work_units.work_unit_id = work_units.id
The excluded_work_units
version of the code (the second example query above) should work without need for modification for this corner case (and is the one I'd initially trial for live performance metrics).
回答7:
You can get the work units covered by a worker's skills in an aggregation, as has been shown already. You'd typically use IN
on this set of work units then.
SELECT wu.*
FROM work_units wu
-- some joins
WHERE wu.id IN
(
SELECT wus.work_unit_id
FROM work_units_skills wus
LEFT JOIN workers_skills ws ON ws.skill_id = wus.skill_id AND ws.worker_id = 1
GROUP BY wus.work_unit_id
HAVING COUNT(*) = COUNT(ws.skill_id)
)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;
When it comes to speeding up queries, the main part is often to provide the appropriate indexes, though. (With a perfect optimizer, re-writing a query to get the same result would have no effect at all, because the optimizer would get to the same execution plan.)
You want the following indexes (order of the columns matters):
create index idx_ws on workers_skills (worker_id, skill_id);
create index idx_wus on work_units_skills (skill_id, work_unit_id);
(Read it like this: We come with a worker_id
, get the skill_ids
for the worker, join work units on these skill_ids
and get thus the work_unit_ids
.)
回答8:
Might not apply to you, but I had a similar issue that I solved simply merging the main and sub into the same column using numbers for main and letters for sub.
Btw, are all columns involved in the joins indexed? My server goes from 2-3 sec query on 500k+ tables to crash on 10k tables if I forget
回答9:
With Postgres, relational division can often be expressed more efficiently using arrays.
In your case I think the following will do what you want:
select *
from work_units
where id in (select work_unit_id
from work_units_skills
group by work_unit_id
having array_agg(skill_id) <@ array(select skill_id
from workers_skills
where worker_id = 6))
and ... other conditions here ...
order by ...
array_agg(skill_id)
collects all skill_ids for each work_unit and compares that with the skills of a specific worker using the <@
operator ("is contained by"). That condition returns all work_unit_ids where the list of skill_ids is contained in the skills for a single worker.
In my experience this approach is usually faster then equivalent exists or intersect solutions.
Online example: http://rextester.com/WUPA82849
来源:https://stackoverflow.com/questions/47440855/how-to-efficiently-set-subtract-a-join-table-in-postgresql