Find possible duplicates in two columns ignoring case and special characters

后端未结

关注

 3  1045

Query

SELECT COUNT(*), name, number
FROM   tbl
GROUP  BY name, number
HAVING COUNT(*) > 1

It sometimes fails to find duplicates between lowe

相关标签:

3条回答

星月不相逢

2021-02-09 05:57
(Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résumé into resume, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.

The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ö might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.

You can see this in English with ligatures, where the word dæmon is ascii-ized daemon. If you stripped the ligature you'd get dmon which wouldn't match daemon, the common spelling. The same is true of æther which is typically ascii-ized to aether or ether. You can also see this in German with ß, typically "expanded" as ss.

If you must attempt to "un-accent", "normalize" accents or "strip" accents:

You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \W escape (shorthand for the character class [^[:alnum:]_] as per the manual) to exclude "symbols" but not accented characters:
```
regress=# SELECT regexp_replace(lower(x),'\W','','g') 
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 café
(2 rows)
```
If you want to filter out accented chars too you can define your own character class:
```
regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 caf
(2 rows)
```
If you actually intended to substitute some accented characters for similar unaccented characters, you could use translate as per this wiki article:
```
regress=# SELECT translate(
        lower(x),
        'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
        'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
    )
    FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);

 translate 
-----------
 $s^o&f!t
 cafe
(2 rows)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2021-02-09 06:11
PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:
```
SELECT COUNT(*), lower(name), number FROM TABLE 
GROUP BY lower(name), number HAVING COUNT(*) > 1
```
- NOTE: This has not been tested in Postgres
0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2021-02-09 06:22
lower()/ upper()

Use one of these to fold characters to either lower or upper case. Special characters are not affected:
```
SELECT count(*), lower(name), number
FROM   tbl
GROUP  BY lower(name), number
HAVING count(*) > 1;
```
unaccent()

If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():
```
CREATE EXTENSION unaccent;
```
Makes it very simple:
```
SELECT lower(unaccent('Büßercafé')) AS norm
```
Result:
```
busercafe
```
This doesn't strip non-letters. Add regexp_replace() like @Craig mentioned for that:
```
SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ))
                                                                     AS norm
```
Result:
```
softbusercafe
```
You can even build a functional index on top of that:
- Does PostgreSQL support "accent insensitive" collations?
0 讨论(0)
发布评论:

提交评论
- 加载中...

Find possible duplicates in two columns ignoring case and special characters

Query

If you must attempt to "un-accent", "normalize" accents or "strip" accents:

lower()/ upper()

unaccent()