问题
I have 2 databases, both have names of companies, but in different formats. I have been able to do exact matching using vlookup
. I want to extract companies that were written differently, but they are actually the same company and extract the data.
Below is a small part of the databases I have
Database 1
Column A
1-800-Flowers.com Inc
Abbott Laboratories (Abbott)
21st Century Fox America Inc (formerly News America Inc)
Column B
1234(data I need to grab)
4567
8910
Database 2
Column C
1-800 CONTACTS INC
1-800-FLOWERS.COM
ABBOTT LABORATORIES
TWENTY-FIRST CENTURY FOX INC
Column D
ABCD(DataI can ignore as the company doesn't exist in database 1)
EFGH (Data I need as it matches from Database 1)
IJK
LMNO
As you can see from the above databases, Database 1 matches Database 2's in similar words like 21st Century Fox America Inc vs Twenty-first Century Fox Inc
In my database 1, I have about 4000+ values, while in database 2, I have 10,000 values. Is there a code to compare similar words between both databases and extract the data I need from columns B and D?
I have tried query, but it doesn't work the way I wanted it to. This is my shareable link.
Currently, What I have done is to extract the words which are similar using REGEXTRACT to find a match between the strings like Century Fox in 21st Century Fox and Twenty-First Century Fox and attempted to match both data sets using query. However my query result comes up with NA when I write it like this
=query(E:E,"Select E where E contains '"&L2&"'",0 )
L2 being the cell that contains the string Century Fox
回答1:
L2:
=ARRAYFORMULA(INDEX($E$2:$E$68,MATCH(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7)),ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7),0)))
M2:
=ARRAYFORMULA(INDEX($E$2:$F$68,MATCH(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7)),ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7),0),2))
N2:
=ARRAYFORMULA(TEXT(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7))/LEN(A2),"0%"))
Drag fill down.
Notes:
Formula is resource intensive. Apps Script might be a better choice.
For the given sample, This formula works with a reasonable degree of precision.
7 is the maximum number of words per cell found in all of Column E( or Column C of database 2). This is hardcoded in the above formula. This should be found using a helper column. Z2:COUNTA (SPLIT(A2," ")) Drag fill down. And AA2: =MAX(Z2:Z)
N column gives the degree of confidence in the VLOOKUP produced result. Preferably, Anything below 45% should be rechecked manually.
How it works: All of E column (db2) is split by words and each of the word is looked upon in each entry of A column(db1). If a group of words are matched for multiple entries in E column, then the maximum of the length of matched words is taken and given as the possible match. A letter approach instead of a word approach may give better precision, but seems unnecessary in the given sample.
来源:https://stackoverflow.com/questions/48482798/google-sheets-matching-company-names