Fuzzy matching deduplication in less than exponential time?
问题 I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc). I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once. The former would be a linear time problem (comparing a value against