问题
SELECT commandid
FROM results
WHERE NOT EXISTS (
SELECT *
FROM generate_series(0,119999)
WHERE generate_series = results.commandid
);
I have a column in results
of type int
but various tests failed and were not added to the table. I would like to create a query that returns a list of commandid
that are not found in results
. I thought the above query would do what I wanted. However, it does not even work if I use a range that is outside the expected possible range of commandid
(like negative numbers).
回答1:
Given sample data:
create table results ( commandid integer primary key);
insert into results (commandid) select * from generate_series(1,1000);
delete from results where random() < 0.20;
This works:
SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
WHERE NOT EXISTS (SELECT 1 FROM results WHERE commandid = s.i);
as does this alternative formulation:
SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
LEFT OUTER JOIN results ON (results.commandid = s.i)
WHERE results.commandid IS NULL;
Both of the above appear to result in identical query plans in my tests, but you should compare with your data on your database using EXPLAIN ANALYZE
to see which is best.
Explanation
Note that instead of NOT IN
I've used NOT EXISTS
with a subquery in one formulation, and an ordinary OUTER JOIN
in the other. It's much easier for the DB server to optimise these and it avoids the confusing issues that can arise with NULL
s in NOT IN
.
I initially favoured the OUTER JOIN
formulation, but at least in 9.1 with my test data the NOT EXISTS
form optimizes to the same plan.
Both will perform better than the NOT IN
formulation below when the series is large, as in your case. NOT IN
used to require Pg to do a linear search of the IN
list for every tuple being tested, but examination of the query plan suggests Pg may be smart enough to hash it now. The NOT EXISTS
(transformed into a JOIN
by the query planner) and the JOIN
work better.
The NOT IN
formulation is both confusing in the presence of NULL commandid
s and can be inefficient:
SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
WHERE s.i NOT IN (SELECT commandid FROM results);
so I'd avoid it. With 1,000,000 rows the other two completed in 1.2 seconds and the NOT IN
formulation ran CPU-bound until I got bored and cancelled it.
回答2:
As I mentioned in the comment, you need to do the reverse of the above query.
SELECT
generate_series
FROM
generate_series(0, 119999)
WHERE
NOT generate_series IN (SELECT commandid FROM results);
At that point, you should find values that do not exist within the commandid
column within the selected range.
回答3:
I am not so experienced SQL guru, but I like other ways to solve problem. Just today I had similar problem - to find unused numbers in one character column. I have solved my problem by using pl/pgsql and was very interested in what will be speed of my procedure. I used @Craig Ringer's way to generate table with serial column, add one million records, and then delete every 99th record. This procedure work about 3 sec in searching for missing numbers:
-- creating table
create table results (commandid character(7) primary key);
-- populating table with serial numbers formatted as characters
insert into results (commandid) select cast(num_id as character(7)) from generate_series(1,1000000) as num_id;
-- delete some records
delete from results where cast(commandid as integer) % 99 = 0;
create or replace function unused_numbers()
returns setof integer as
$body$
declare
i integer;
r record;
begin
-- looping trough table with sychronized counter:
i := 1;
for r in
(select distinct cast(commandid as integer) as num_value
from results
order by num_value asc)
loop
if not (i = r.num_value) then
while true loop
return next i;
i = i + 1;
if (i = r.num_value) then
i = i + 1;
exit;
else
continue;
end if;
end loop;
else
i := i + 1;
end if;
end loop;
return;
end;
$body$
language plpgsql volatile
cost 100
rows 1000;
select * from unused_numbers();
Maybe it will be usable for someone.
回答4:
If you're on AWS redshift, you might end up needing to defy the question, since it doesn't support generate_series
. You'll end up with something like this:
select
startpoints.id gapstart,
min(endpoints.id) resume
from (
select id+1 id
from yourtable outer_series
where not exists
(select null
from yourtable inner_series
where inner_series.id = outer_series.id + 1
)
order by id
) startpoints,
yourtable endpoints
where
endpoints.id > startpoints.id
group by
startpoints.id;
来源:https://stackoverflow.com/questions/12444142/postgresql-how-to-figure-out-missing-numbers-in-a-column-using-generate-series