Get columns that differ between 2 rows

问题

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.

Example: I find 2 companies that potentially are the same, but I need to know which values (columns) differ between these 2 rows in order to continue.

I think it is possible to compare column by column x 60, but I search for a simpler and more generic solution.

Something like:

SELECT * FROM company where co_id=22
SHOW DIFFERENCE
SELECT * FROM company where co_id=33

The result should be the column names that differ.

回答1:

You may use the hstore extension for this. It often comes handy when the there's the need to iterate over columns.

The trick is to convert, for each row, the contents to column_name=>value pairs into a hstore value, and then use the hstore functions to compute the differences.

Demo:

CREATE TABLE table1 (id int primary key, t1 text, t2 text, t3 text);

Let's insert two rows that differ by the primary key and one other column (t3).

INSERT INTO table1 VALUES (
 (1,'foo','bar','baz'),
 (2,'foo','bar','biz')
);

The query:

SELECT skeys(h1-h2) from 
  (select hstore(t.*) as h1 from table1 t where id=1) h1
 CROSS JOIN
  (select hstore(t.*) as h2 from table1 t where id=2) h2;

h1-h2 computes the difference key by key and skeys() outputs the result as a set.

Result:

 skeys 
-------
 id
 t3

The select-list might be refined with skeys((h1-h2)-'id') to always remove id which, as the primary key, will obviously always differ between rows.

回答2:

Here's a stored procedure that should get you most of the way...

While this should work "as is", it has no error checking, which you should add.

It gets all the columns in the table, and loops over them. A difference is when the count of the distinct items is more than one. Also, the output is:

The count of the number of differences
Messages for each column where there is a difference

It might be more useful to return a rowset of the columns with the differences. Anyway, good luck!

Usage:

SELECT showdifference('public','company','co_id',22,33)


CREATE OR REPLACE FUNCTION showdifference(p_schema text, p_tablename text,p_idcolumn text,p_firstid integer, p_secondid integer)
  RETURNS INTEGER AS
$BODY$ 
DECLARE
    l_diffcount INTEGER;
    l_column text;
    l_dupcount integer;
    column_cursor CURSOR FOR select column_name from information_schema.columns where table_name = p_tablename and table_schema = p_schema and column_name <> p_idcolumn;
BEGIN


    -- need error checking here, to ensure the table and schema exist and the columns exist

    -- Should also check that the records ids exist.

    -- Should also check that the column type of the id field is integer


    -- Set the number of differences to zero.

    l_diffcount := 0;

    -- use a cursor to iterate over the columns found in information_schema.columns
    -- open the cursor

    OPEN column_cursor;

    LOOP
        FETCH column_cursor INTO l_column;
        EXIT WHEN NOT FOUND;

        -- build a query to see if there is a difference between the columns. If there is raise a notice
        EXECUTE 'select count(distinct  ' || quote_ident(l_column) || ' ) from ' || quote_ident(p_schema) || '.' || quote_ident(p_tablename) || ' where ' || quote_ident(p_idcolumn) || ' in ('|| p_firstid || ',' || p_secondid ||')'
        INTO l_dupcount;



        IF l_dupcount > 1 THEN
        -- increment the counter
        l_diffcount := l_diffcount +1;
        RAISE NOTICE  '% has % differences', l_column, l_dupcount ; -- for "real" you might want to return a rowset and could do something here

        END IF;


    END LOOP;




    -- close the cursor
    CLOSE column_cursor;


    RETURN l_diffcount;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE STRICT
  COST 100;

来源：https://stackoverflow.com/questions/28630354/get-columns-that-differ-between-2-rows

标签

postgresql

postgresql-9.1

duplicate-removal