SELECT COUNT(*), name, number
FROM tbl
GROUP BY name, number
HAVING COUNT(*) > 1
It sometimes fails to find duplicates between lowe
(Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résumé
into resume
, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.
The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ö
might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.
You can see this in English with ligatures, where the word dæmon is ascii-ized daemon
. If you stripped the ligature you'd get dmon
which wouldn't match daemon
, the common spelling. The same is true of æther
which is typically ascii-ized to aether
or ether
. You can also see this in German with ß, typically "expanded" as ss
.
You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \W
escape (shorthand for the character class [^[:alnum:]_]
as per the manual) to exclude "symbols" but not accented characters:
regress=# SELECT regexp_replace(lower(x),'\W','','g')
FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
regexp_replace
----------------
soft
café
(2 rows)
If you want to filter out accented chars too you can define your own character class:
regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
regexp_replace
----------------
soft
caf
(2 rows)
If you actually intended to substitute some accented characters for similar unaccented characters, you could use translate
as per this wiki article:
regress=# SELECT translate(
lower(x),
'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
)
FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
translate
-----------
$s^o&f!t
cafe
(2 rows)
PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:
SELECT COUNT(*), lower(name), number FROM TABLE
GROUP BY lower(name), number HAVING COUNT(*) > 1
Use one of these to fold characters to either lower or upper case. Special characters are not affected:
SELECT count(*), lower(name), number
FROM tbl
GROUP BY lower(name), number
HAVING count(*) > 1;
If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent()
:
CREATE EXTENSION unaccent;
Makes it very simple:
SELECT lower(unaccent('Büßercafé')) AS norm
Result:
busercafe
This doesn't strip non-letters. Add regexp_replace()
like @Craig mentioned for that:
SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ))
AS norm
Result:
softbusercafe
You can even build a functional index on top of that: