Find possible duplicates in two columns ignoring case and special characters

后端 未结 3 1015
北海茫月
北海茫月 2021-02-09 05:22

Query

SELECT COUNT(*), name, number
FROM   tbl
GROUP  BY name, number
HAVING COUNT(*) > 1

It sometimes fails to find duplicates between lowe

相关标签:
3条回答
  • 2021-02-09 05:57

    (Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résumé into resume, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.

    The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ö might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.

    You can see this in English with ligatures, where the word dæmon is ascii-ized daemon. If you stripped the ligature you'd get dmon which wouldn't match daemon, the common spelling. The same is true of æther which is typically ascii-ized to aether or ether. You can also see this in German with ß, typically "expanded" as ss.

    If you must attempt to "un-accent", "normalize" accents or "strip" accents:

    You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \W escape (shorthand for the character class [^[:alnum:]_] as per the manual) to exclude "symbols" but not accented characters:

    regress=# SELECT regexp_replace(lower(x),'\W','','g') 
              FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
     regexp_replace 
    ----------------
     soft
     café
    (2 rows)
    

    If you want to filter out accented chars too you can define your own character class:

    regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
              FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
     regexp_replace 
    ----------------
     soft
     caf
    (2 rows)
    

    If you actually intended to substitute some accented characters for similar unaccented characters, you could use translate as per this wiki article:

    regress=# SELECT translate(
            lower(x),
            'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
            'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
        )
        FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
    
     translate 
    -----------
     $s^o&f!t
     cafe
    (2 rows)
    
    0 讨论(0)
  • 2021-02-09 06:11

    PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:

    SELECT COUNT(*), lower(name), number FROM TABLE 
    GROUP BY lower(name), number HAVING COUNT(*) > 1
    
    • NOTE: This has not been tested in Postgres
    0 讨论(0)
  • 2021-02-09 06:22

    lower()/ upper()

    Use one of these to fold characters to either lower or upper case. Special characters are not affected:

    SELECT count(*), lower(name), number
    FROM   tbl
    GROUP  BY lower(name), number
    HAVING count(*) > 1;
    

    unaccent()

    If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():

    CREATE EXTENSION unaccent;
    

    Makes it very simple:

    SELECT lower(unaccent('Büßercafé')) AS norm
    

    Result:

    busercafe
    

    This doesn't strip non-letters. Add regexp_replace() like @Craig mentioned for that:

    SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ))
                                                                         AS norm
    

    Result:

    softbusercafe
    

    You can even build a functional index on top of that:

    • Does PostgreSQL support "accent insensitive" collations?
    0 讨论(0)
提交回复
热议问题