strategies for finding duplicate mailing addresses

后端 未结 6 1513
悲哀的现实
悲哀的现实 2021-02-10 02:08

I\'m trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = \'# 3 FAIRMONT LIN         


        
6条回答
  •  清酒与你
    2021-02-10 02:27

    I had to do this once. I converted everything to lowercase, computed each address's Levenshtein distance to every other address, and ordered the results. It worked very well, but it was quite time-consuming.

    You'll want to use an implementation of Levenshtein in C rather than in Python if you have a large data set. Mine was a few tens of thousands and took the better part of a day to run, I think.

提交回复
热议问题