strategies for finding duplicate mailing addresses

后端 未结 6 1532
悲哀的现实
悲哀的现实 2021-02-10 02:08

I\'m trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = \'# 3 FAIRMONT LIN         


        
6条回答
  •  鱼传尺愫
    2021-02-10 02:44

    First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):

    adr = " ".join(adr.tolower().split())
    

    Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":

    adr = re.sub("1st(\b|$)", r'1', adr)
    adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)
    

    Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).

    Be sure to read all the help for the re module; it's powerful but cryptic.

    Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:

    http://en.wikipedia.org/wiki/Soundex

    http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html

    adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]
    

    Then you can work with the list or join it back to a string as you think best.

    The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.

    Good luck.

提交回复
热议问题