问题
I have the below table HAVE. How can I go about getting results in "WANT" ? I'll appreciate ideas and I'm open to any fuzzy match algorithm out there
Have
ID Name
1 Davi
2 David
3 DAVID
4 Micheal
5 Michael
6 Oracle
7 Tepper
WANT
ID Name mtch_ind
1 Davi 1
2 David 1
3 DAVID 1
4 Micheal 2
5 Michael 2
6 Oracle 3
7 Tepper 4
TABLE DDL and record insert
CREATE TABLE HAVE (
ID INTEGER,
Name VARCHAR(10)
);
INSERT INTO data VALUES ('1', 'Davi');
INSERT INTO data VALUES ('2', 'David');
INSERT INTO data VALUES ('3', 'DAVID');
INSERT INTO data VALUES ('4', 'Micheal');
INSERT INTO data VALUES ('5', 'Michael');
INSERT INTO data VALUES ('6', 'Oracle');
INSERT INTO data VALUES ('7', 'Tepper');
回答1:
Here is the algo that I believe should work:
Step-1: Identify the close matches by using the Jaro Winkler nearest match with threshold math of 75% select h1.name h2.name, UTL_MATCH.JARO_WINKLER (h1.name,h2.name) as match_confidence from have h1 join have h2 on UTL_MATCH.JARO_WINKLER (h1.name,h2.name) > 0.75--considering 75 % match threshold. enter image description here
Step-2 : Pick h2.name where the match_confidence is maximum or one top row for similar records
for example [enter image description here][enter image description here]2
Step-3 : preform a dense rank operation on the new column to end up in the result you wanted.
Hope this works Note: First post on SO. I don't have access to the oracle at the moment.
回答2:
While this solution is a bit ugly, I came up with this approach. FYI, it's best to first convert uppercase DAVID to to David. Hopefully, someone may find this useful or come up with a better solution. Thanks
with table1 as (
SELECT ROW_NUMBER() OVER (ORDER BY firstID) as rowno,A.* FROM (
select
t1.name
,t1.ID
, case when t1.ID>t1.Fid then fid else T1.ID end as FIRSTID
, case when t1.ID>t1.Fid then T1.id else fID end as SECONDID
, case when t1.ID>t1.Fid then t1.NAME else t1.FNAME end as FIRSTNAME
, case when t1.ID>t1.Fid then t1.FNAME else t1.NAME end as SECONDNAME
, case when count(*) over (partition by id) =1 then 'nodups' else 'dups' end as ID_chk
from (
SELECT h1.NAME,
h1.ID,
h2.id as Fid,
h2.name as Fname,
SYS.UTL_MATCH.JARO_WINKLER_SIMILARITY(h1.name,h2.name) as match1
FROM (select
NAME,
ID from HAVE)h1 , (
select
NAME,
ID FROM
have)
h2 where SYS.UTL_MATCH.JARO_WINKLER_SIMILARITY((h1.name),(h2.name)) > 75
order by h1.id
)
t1
)A
)
, no_dups as
(
select * from table1 where ID_chk='nodups'
)
,dups as
(
select * from table1 where ID_chk<>'nodups'
)
, dups_stp1 as
(
select * from dups
WHERE FIRSTID <> SECONDID
)
, dups_stp2 as
(
select rowno,ID,FIRSTID,SECONDNAME from dups_stp1
where FIRSTID not in (select SECONDID from dups_stp1)
)
select t2.ID,t3.NAME,rnk as mtch_ind from (
select ID,SECONDNAME as NAME, dense_rank() OVER ( ORDER BY SECONDNAME asc)as rnk from (
select distinct ID, FIRSTID, SECONDNAME from dups_stp2
union all
select ID, FIRSTID, SECONDNAME from no_dups
)t1
)t2
inner join HAVE t3 on t2.ID=t3.ID
;
Reference https://www.decisivedata.net/blog/cleaning-messy-data-sql-part-1-fuzzy-matching-names
来源:https://stackoverflow.com/questions/61599416/match-slightly-different-records-in-a-field