问题
I have a table with company names. There are many duplicates because of human input errors. There are different perceptions if the subdivision should be included, typos, etc. I want all these duplicates to be marked as one company "1c":
+------------------+
| company |
+------------------+
| 1c |
| 1c company |
| 1c game studios |
| 1c wireless |
| 1c-avalon |
| 1c-softclub |
| 1c: maddox games |
| 1c:inoco |
| 1cc games |
+------------------+
I identified Levenshtein distance as a good way to eliminate typos. However, when the subdivision is added the Levenshtein distance increases dramatically and is no longer a good algorithm for this. Is this correct?
In general I have barely any experience in Computational Linguistics so I am at a loss what methods I should choose.
What algorithms would you recommend for this problem? I want to implement it in java. Pure SQL would also be okay. Links to sources would be appreciated. Thanks.
回答1:
This is a difficult problem. A magic search keyword that might help you is "normalization" - while sometimes it means very different things ("database normalization" is unrelated, for example), you are effectively trying to normalize your input here.
A simple solution is to use Levenshtein distance with token awareness. The Python library Fuzzy Wuzzy does this and this blog post introduces how it works with motivating examples. The basic idea is simple enough you should be able to implement it in Java without much difficulty.
At a high level, the idea is to split the input into tokens on whitespace and maybe punctuation, then sort the tokens and treat them as a set, then use the set intersection size - allowing for fuzzy matching - as a metric.
Some related links:
- Are there any good libraries available for doing normalization of company names? - Open Data Stack Exchange
- NEMO: Extraction and normalization of organization names from PubMed affiliation strings
- Automatic gazetteer enrichment with user-geocoded data - For place names, this basically creates a list of "true" names and then uses fuzzy lookup.
- Normalizing company names with SPARQL and DBpedia - bobdc.blog - Uses Wikipedia redirect information.
来源:https://stackoverflow.com/questions/44725930/duplicate-elimination-of-similar-company-names