How to efficiently set subtract a join table in PostgreSQL?

问题

I have the following tables:

work_units - self explanatory
workers - self explanatory
skills - every work unit requires a number of skills if you want to work on it. Every worker is proficient in a number of skills.
work_units_skills - join table
workers_skills - join table

A worker can request the next appropriate free highest priority (whatever that means) unit of work to be assigned to her.

Currently I have:

SELECT work_units.*
FROM work_units
-- some joins
WHERE NOT EXISTS (
        SELECT skill_id
        FROM work_units_skills
        WHERE work_unit_id = work_units.id

        EXCEPT

        SELECT skill_id
        FROM workers_skills
        WHERE worker_id = 1 -- the worker id that made the request
      )
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

This condition makes the query 8-10 times slower though.

Is there a better way to express that a work_units's skills should be a subset of the workers's skills or something to improve the current query?

Some more context:

The skills table is fairly small.
Both work_units and workers tend to have very few associated skills.
work_units_skills has index on work_unit_id.
I tried moving the query on workers_skills into a CTE. This gave a slight improvement (10-15%), but it's still too slow.
A work unit with no skill can be picked up by any user. Aka an empty set is a subset of every set.

回答1:

One simple speed-up would be to use EXCEPT ALL instead of EXCEPT. The latter removes duplicates, which is unnecessary here and can be slow.

An alternative that would probably be faster is to use a further NOT EXISTS instead of the EXCEPT:

...
WHERE NOT EXISTS (
        SELECT skill_id
        FROM work_units_skills wus
        WHERE work_unit_id = work_units.id
        AND NOT EXISTS (
            SELECT skill_id
            FROM workers_skills ws
            WHERE worker_id = 1 -- the worker id that made the request
              AND ws.skill_id = wus.skill_id
        )
      )

Demo

http://rextester.com/AGEIS52439 - with the the LIMIT removed for testing

回答2:

(see UPDATE below)

This query finds a good work_unit using a simple LEFT JOIN to find a missing skill in the shorter table of skills the requesting worker has. The trick is whenever there is a missing skill, there will be a NULL value in the join and this is translated to a 1 and the work_unit is removed by leaving the ones with all 0 values i.e. having a max of 0.

Being classic SQL this would be the most heavily targeted query for optimization by the engine:

SELECT work_unit_id
FROM
  work_units_skills s
LEFT JOIN
  (SELECT skill_id FROM workers_skills WHERE worker_id = 1) t
ON (s.skill_id=t.skill_id)
GROUP BY work_unit_id
HAVING max(CASE WHEN t.skill_id IS NULL THEN 1 ELSE 0 END)=0
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

UPDATE

In order to catch work_units with no skills, we throw the work_units table into the JOIN:

SELECT r.id AS work_unit_id
FROM
  work_units r
LEFT JOIN
  work_units_skills s ON (r.id=s.work_unit_id)
LEFT JOIN
  (SELECT skill_id FROM workers_skills WHERE worker_id = 1) t
ON (s.skill_id=t.skill_id)
GROUP BY r.id
HAVING bool_or(s.skill_id IS NULL) OR bool_and(t.skill_id IS NOT NULL)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

回答3:

You may use the following query

SELECT wu.*
FROM work_units wu
LEFT JOIN work_units_skills wus ON wus.work_unit_id = wu.id and wus.skill_id IN (
    SELECT id
    FROM skills
    EXCEPT
    SELECT skill_id
    FROM workers_skills
    WHERE worker_id = 1 -- the worker id that made the request
)
WHERE wus.work_unit_id IS NULL;

demo (thanks, Steve Chambers for most of the data)

You should definitely have index on work_units_skills(skill_id), workers_skills(worker_id) and work_units(id). If you want to speed it up, even more, create indexes work_units_skills(skill_id, work_unit_id) and workers_skills(worker_id, skill_id) which avoid accessing those tables.

The subquery is independent and outer join should relatively fast if the result is not large.

回答4:

Bit-Mask Solution
Without any changes in your previous Database Design, just add 2 fields.
First: a long or bigint (related to your DBMS) into Workers
Second: another long or bigint into Work_Units

These fields show skills of work_units and skills of workers. For example suppose that you have 8 records in Skills table. (notice that records of skill in small)
1- some skill 1
2- some skill 2
...
8- some skill 8

then if we want to set skills 1,3,6,7 to one work_unit, just use this number 01100101.
(I offer to use reversed version of binary 0,1 placement to support additional skills in future.)

In practice you can use 10 base number to add in database (101 instead of 01100101)

Similar number can be generated to workers. Any worker choose some skills. So we can turn the selected items into a number and save it in additional field in Worker table.

Finally, to find appropriate work_units subset for any worker JUST select from work_units and use bitwise AND like below.
A: new_field_of_specific_worker (shows Skills of each Worker) that we are searching works_units related to him/her right now.
B: new_field_of_work_units that shows Skills of each work_unit

select * from work_units
where A & B  = B

Notice:
1: absolutely, this is fastest way but it has some difficulties.
2: we have some extra difficulties when a new skill is Added or to be Delete. But this is a trade-off. Adding or Deleting new skills less happens.
3: we should use skills and work_unit_skills and workers_skills too. But in search, we just use new fields

Also, this approach can be used for TAG Management systems like Stack Overflow TAGs.

回答5:

With the current info I can only answer on a hunch. Try removing the EXCEPT-statement and see if it gets significantly faster. If it does, you can add that part again, but using WHERE-conditions. In my experience set operators (MINUS/EXCEPT, UNION, INTERSECT) are quite the performance killers.

回答6:

The correlated sub-query is punishing you, especially with the additional use of EXCEPT.

To paraphrase your query, you're only interested in a work_unit_id when a specified worker has ALL of that work_unit's skills? (If a work_unit has a skill associated with it, but the specified user doesn't have that skill, exclude that work_unit?)

This can be achieve with JOINs and GROUP BY, and no need for correlation at all.

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
INNER JOIN
(
    SELECT
        wus.work_unit_id
    FROM
        work_unit_skills   wus
    LEFT JOIN
        workers_skills     ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    GROUP BY
        wus.work_unit_id
    HAVING
        COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
     applicable_work_units
         ON  applicable_work_units.work_unit_id = work_units.id
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1

The sub-query compares a worker's skill set to each work unit's skill set. If there are any skills the work unit has that the worker doesn't then ws.skill_id will be NULL for that row, and as NULL is ignored by COUNT() this means that COUNT(ws.skill_id) will then be lower than COUNT(wus.skill_id), and so that work_unit would become excluded from the sub-query's results.

This assumes that the workers_skills table is unique over (work_id, skill_id) and that the work_unit_skills table is unique over (work_unit_id, skill_id). If that's not the case then you may want to tinker with the HAVING clause (such as COUNT(DISTINT wus.skill_id), etc).

EDIT:

The above query assumes that only relatively low number of work units would match the criteria of matching a specific worker.

If you assume that a relatively large number of work units would match, the opposite logic would be faster.

(Essentially, try to make the number of rows returns by the sub-query as low as possible.)

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
LEFT JOIN
(
    SELECT
        wus.work_unit_id
    FROM
        work_unit_skills   wus
    LEFT JOIN
        workers_skills     ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    WHERE
        ws.skill_id IS NULL
    GROUP BY
        wus.work_unit_id
)
     excluded_work_units
         ON  excluded_work_units.work_unit_id = work_units.id
WHERE
    excluded_work_units.work_unit_id IS NULL
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1

This one compares all work unit skills with those of the worker, and only keeps rows where the work unit has skills that the worker does not have.

Then, GROUP BY the work unit to get a list of work units that need to be ignored.

By LEFT joining these on to your existing results, you can stipulate you only want to include a work unit if it doesn't appear in the sub-query by specifying excluded_work_units.work_unit_id IS NULL.

Useful online guides will refer to anti-join and anti-semi-join.

EDIT:

In general I would recommend against the use of a bit-mask.

Not because it's slow, but because it defies normalisation. The existence of a single field representing multiple items of data is a general sql-code-smell / sql-anti-pattern, as the data is no longer atomic. (This leads to pain down the road, especially if you reach a world where you have so many skills that they no longer all fit in to the data type chosen for the bit-mask, or when it comes to managing frequent or complex changes to the skill sets.)

That said, if performance continues to be an issue, de-normalisation is often a very useful option. I'd recommend keeping the bit masks in separate tables to make it clear that they're de-normalised / cached calcualtion results. In general though, such options should be a last resort rather than a first reaction.

EDIT: Example revisions to always include work_units that have no skills...

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
INNER JOIN
(
    SELECT
        w.id   AS work_unit_id
    FROM
        work_units          w
    LEFT JOIN
        work_units_skills   wus
            ON wus.work_unit_id = w.id
    LEFT JOIN
        workers_skills      ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    GROUP BY
        w.id
    HAVING
        COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
     applicable_work_units
         ON  applicable_work_units.work_unit_id = work_units.id

The excluded_work_units version of the code (the second example query above) should work without need for modification for this corner case (and is the one I'd initially trial for live performance metrics).

回答7:

You can get the work units covered by a worker's skills in an aggregation, as has been shown already. You'd typically use IN on this set of work units then.

SELECT wu.*
FROM work_units wu
-- some joins
WHERE wu.id IN
(
  SELECT wus.work_unit_id
  FROM work_units_skills wus
  LEFT JOIN workers_skills ws ON ws.skill_id = wus.skill_id AND ws.worker_id = 1
  GROUP BY wus.work_unit_id
  HAVING COUNT(*) = COUNT(ws.skill_id)
)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

When it comes to speeding up queries, the main part is often to provide the appropriate indexes, though. (With a perfect optimizer, re-writing a query to get the same result would have no effect at all, because the optimizer would get to the same execution plan.)

You want the following indexes (order of the columns matters):

create index idx_ws on workers_skills (worker_id, skill_id);
create index idx_wus on work_units_skills (skill_id, work_unit_id);

(Read it like this: We come with a worker_id, get the skill_ids for the worker, join work units on these skill_ids and get thus the work_unit_ids.)

回答8:

Might not apply to you, but I had a similar issue that I solved simply merging the main and sub into the same column using numbers for main and letters for sub.

Btw, are all columns involved in the joins indexed? My server goes from 2-3 sec query on 500k+ tables to crash on 10k tables if I forget

回答9:

With Postgres, relational division can often be expressed more efficiently using arrays.

In your case I think the following will do what you want:

select *
from work_units
where id in (select work_unit_id
             from work_units_skills
             group by work_unit_id
             having array_agg(skill_id) <@ array(select skill_id 
                                                 from workers_skills 
                                                 where worker_id = 6))
and ... other conditions here ...
order by ...

array_agg(skill_id) collects all skill_ids for each work_unit and compares that with the skills of a specific worker using the <@ operator ("is contained by"). That condition returns all work_unit_ids where the list of skill_ids is contained in the skills for a single worker.

In my experience this approach is usually faster then equivalent exists or intersect solutions.

Online example: http://rextester.com/WUPA82849

来源：https://stackoverflow.com/questions/47440855/how-to-efficiently-set-subtract-a-join-table-in-postgresql

标签

sql

postgresql

performance

relational-division

set-operations