How to get unique values from each column based on a condition?

I have been trying to find an optimal solution to select unique values from each column. My problem is I don't know column names in advance since different table has different number of columns. So first, I have to find column names and I could use below query to do it:

select column_name from information_schema.columns
where table_name='m0301010000_ds' and column_name like 'c%'

Sample output for column names:

c1, c2a, c2b, c2c, c2d, c2e, c2f, c2g, c2h, c2i, c2j, c2k, ...

Then I would use returned column names to get unique/distinct value in each column and not just distinct row.

I know a simplest and lousy way is to write select distict column_name from table where column_name = 'something' for every single column (around 20-50 times) and its very time consuming too. Since I can't use more than one distinct per column_name, I am stuck with this old school solution.

I am sure there would be a faster and elegant way to achieve this, and I just couldn't figure how. I will really appreciate any help on this.

Erwin Brandstetter

You can't just return rows, since distinct values don't go together any more.

You could return arrays, which can be had simpler than you may have expected:

SELECT array_agg(DISTINCT c1)  AS c1_arr
      ,array_agg(DISTINCT c2a) AS c2a_arr
      ,array_agg(DISTINCT c2b) AS c2ba_arr
      , ...
FROM   m0301010000_ds;

This returns distinct values per column. One array (possibly big) for each column. All connections between values in columns (what used to be in the same row) are lost in the output.

Build SQL automatically

CREATE OR REPLACE FUNCTION f_build_sql_for_dist_vals(_tbl regclass)
  RETURNS text AS
$func$
SELECT 'SELECT ' || string_agg(format('array_agg(DISTINCT %1$I) AS %1$I_arr'
                                     , attname)
                              , E'\n      ,' ORDER  BY attnum)
        || E'\nFROM   ' || _tbl
FROM   pg_attribute
WHERE  attrelid = _tbl            -- valid, visible table name 
AND    attnum >= 1                -- exclude tableoid & friends
AND    NOT attisdropped           -- exclude dropped columns
$func$  LANGUAGE sql;

Call:

SELECT f_build_sql_for_dist_vals('public.m0301010000_ds');

Returns an SQL string as displayed above.

I use the system catalog pg_attribute instead of the information schema. And the object identifier type regclass for the table name. More explanation in this related answer:
PLpgSQL function to find columns with only NULL values in a given table

If you need this in "real time", you won't be able to archive it using a SQL that needs to do a full table scan to archive it.

I would advise you to create a separated table containing the distinct values for each column (initialized with SQL from @Erwin Brandstetter ;) and maintain it using a trigger on the original table.

Your new table will have one column per field. # of row will be equals to the max number of distinct values for one field.

For on insert: for each field to maintain check if that value is already there or not. If not, add it.

For on update: for each field to maintain that has old value != from new value, check if the new value is already there or not. If not, add it. Regarding the old value, check if any other row has that value, and if not, remove it from the list (set field to null).

For delete : for each field to maintain, check if any other row has that value, and if not, remove it from the list (set value to null).

This way the load mainly moved to the trigger, and the SQL on the value list table will super fast.

P.S.: Make sure to pass all you SQL from trigger to explain plan to make sure they use best index and execution plan as possible. For update/deletion, just check if old value exists (limit 1).

来源：https://stackoverflow.com/questions/23745666/how-to-get-unique-values-from-each-column-based-on-a-condition

标签

sql

postgresql

postgresql-9.1

postgresql-performance