How do I (or can I) SELECT DISTINCT on multiple columns?

前端 未结 5 684
梦谈多话
梦谈多话 2020-11-21 23:53

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day f

相关标签:
5条回答
  • 2020-11-22 00:04

    The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.

    This should work however (and doesn't need a join):

    UPDATE sales
    SET status='ACTIVE'
    WHERE id IN (
      SELECT MIN(id) FROM sales
      GROUP BY saleprice, saledate
      HAVING COUNT(id) = 1
    )
    

    You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.

    0 讨论(0)
  • 2020-11-22 00:05
    SELECT DISTINCT a,b,c FROM t
    

    is roughly equivalent to:

    SELECT a,b,c FROM t GROUP BY a,b,c
    

    It's a good idea to get used to the GROUP BY syntax, as it's more powerful.

    For your query, I'd do it like this:

    UPDATE sales
    SET status='ACTIVE'
    WHERE id IN
    (
        SELECT id
        FROM sales S
        INNER JOIN
        (
            SELECT saleprice, saledate
            FROM sales
            GROUP BY saleprice, saledate
            HAVING COUNT(*) = 1 
        ) T
        ON S.saleprice=T.saleprice AND s.saledate=T.saledate
     )
    
    0 讨论(0)
  • 2020-11-22 00:06

    If your DBMS doesn't support distinct with multiple columns like this:

    select distinct(col1, col2) from table
    

    Multi select in general can be executed safely as follows:

    select distinct * from (select col1, col2 from table ) as x
    

    As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.

    0 讨论(0)
  • 2020-11-22 00:12

    If you put together the answers so far, clean up and improve, you would arrive at this superior query:

    UPDATE sales
    SET    status = 'ACTIVE'
    WHERE  (saleprice, saledate) IN (
        SELECT saleprice, saledate
        FROM   sales
        GROUP  BY saleprice, saledate
        HAVING count(*) = 1 
        );
    

    Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).

    But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:

    UPDATE sales s
    SET    status = 'ACTIVE'
    WHERE  NOT EXISTS (
       SELECT FROM sales s1                     -- SELECT list can be empty for EXISTS
       WHERE  s.saleprice = s1.saleprice
       AND    s.saledate  = s1.saledate
       AND    s.id <> s1.id                     -- except for row itself
       )
    AND    s.status IS DISTINCT FROM 'ACTIVE';  -- avoid empty updates. see below
    

    db<>fiddle here
    Old SQL Fiddle

    Unique key to identify row

    If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):

       AND    s1.ctid <> s.ctid
    

    Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.

    Related:

    • In-order sequence generation
    • Auto increment table column

    How is this faster?

    The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.

    Exclude empty updates

    For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:

    If status is defined NOT NULL, you can simplify to:

    AND status <> 'ACTIVE';
    

    The data type of the column must support the <> operator. Some types like json don't. See:

    • How to query a json column for empty objects?

    Subtle difference in NULL handling

    This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):

    (123, NULL)
    (123, NULL)
    

    Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:

    • Create unique constraint with null columns

    OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:

    • How to delete duplicate rows without unique identifier

    If all columns being compared are defined NOT NULL, there is no room for disagreement.

    0 讨论(0)
  • 2020-11-22 00:19

    I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using

    Select distinct GrondOfLucht,sortering
    from CorWijzeVanAanleg
    order by sortering
    

    It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.

    use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering

    SELECT        GrondOfLucht
    FROM            dbo.CorWijzeVanAanleg
    GROUP BY GrondOfLucht, sortering
    ORDER BY MIN(sortering)
    
    0 讨论(0)
提交回复
热议问题