Find possible duplicates in two columns ignoring case and special characters

后端 未结 3 1042
北海茫月
北海茫月 2021-02-09 05:22

Query

SELECT COUNT(*), name, number
FROM   tbl
GROUP  BY name, number
HAVING COUNT(*) > 1

It sometimes fails to find duplicates between lowe

3条回答
  •  灰色年华
    2021-02-09 06:22

    lower()/ upper()

    Use one of these to fold characters to either lower or upper case. Special characters are not affected:

    SELECT count(*), lower(name), number
    FROM   tbl
    GROUP  BY lower(name), number
    HAVING count(*) > 1;
    

    unaccent()

    If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():

    CREATE EXTENSION unaccent;
    

    Makes it very simple:

    SELECT lower(unaccent('Büßercafé')) AS norm
    

    Result:

    busercafe
    

    This doesn't strip non-letters. Add regexp_replace() like @Craig mentioned for that:

    SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ))
                                                                         AS norm
    

    Result:

    softbusercafe
    

    You can even build a functional index on top of that:

    • Does PostgreSQL support "accent insensitive" collations?

提交回复
热议问题