strategies for finding duplicate mailing addresses

后端 未结 6 1517
悲哀的现实
悲哀的现实 2021-02-10 02:08

I\'m trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = \'# 3 FAIRMONT LIN         


        
6条回答
  •  长发绾君心
    2021-02-10 02:34

    I regularly inspect addresses for duplication where I work, and I have to say, I find Soundex highly unsuitable. It's both too slow and too eager to match things. I have similar issues with Levenshtein distance.

    What has worked best for me is to sanitize and tokenize the addresses (get rid of punctuation, split things up into words) and then just see how many tokens match up. Because addresses typically have several tokens, you can develop a level of confidence in terms of a combination of (1) how many tokens were matched, (2) how many numeric tokens were matched, and (3) how many tokens are available. For example, if all tokens in the shorter address are in the longer address, the confidence of a match is pretty high. Likewise, if you match 5 tokens including at least one that's numeric, even if the addresses each have 8, that's still a high-confidence match.

    It's definitely useful to do some tweaking, like substituting some common abbreviations. The USPS lists help, though I wouldn't go gung-ho trying to implement all of them, and some of the most valuable substitutions aren't on those lists. For example, 'JFK' should be a match for 'JOHN F KENNEDY', and there are a number of common ways to shorten 'MARTIN LUTHER KING JR'.

    Maybe it goes without saying but I'll say it anyway, for completeness: Don't forget to just do a straight string comparison on the whole address before messing with more complicated things! This should be a very cheap test, and thus is probably a no-brainer first pass.

    Obviously, the more time you're willing and able to spend (both on programming/testing and on run time), the better you'll be able to do. Fuzzy string matching techniques (faster and less generalized kinds than Levenshtein) can be useful, as a separate pass from the token approach (I wouldn't try to fuzzy match individual tokens against each other). I find that fuzzy string matching doesn't give me enough bang for my buck on addresses (though I will use it on names).

提交回复
热议问题