Compare 5000 strings with PHP Levenshtein

前端 未结 8 1678
长情又很酷
长情又很酷 2021-01-30 12:15

I have 5000, sometimes more, street address strings in an array. I\'d like to compare them all with levenshtein to find similar matches. How can I do this without looping throug

8条回答
  •  借酒劲吻你
    2021-01-30 12:35

    Given you problem, I see no other way than to compare every address with every other address if you want to use Lehvenstein distance.

    First of all, you should normalize addressess, get rid of abbreviations etc.

    • Ave -> Avenue
    • Rd. -> Road

    You could have some fixed max Lehvenstein distance ( N ) for similar addresses.

    If so, you could you could abort the Lehvenstein algorithm when you are sure that the edit distance for current address pair is larger than N. For this you need to write a custom version of Lehvenstein algorithm. This will make the algorithm a little quicker.

    There are also some related trivial optimizations. For example: if address A is 10 chars long and address B is 20 chars long and you consider addresses that have Lehvenstein distance of less than 8 to be similar. You can look lengths of addresses and immediately decide that they are not similar.

提交回复
热议问题