Compare 5000 strings with PHP Levenshtein

前端 未结 8 1673
长情又很酷
长情又很酷 2021-01-30 12:15

I have 5000, sometimes more, street address strings in an array. I\'d like to compare them all with levenshtein to find similar matches. How can I do this without looping throug

相关标签:
8条回答
  • 2021-01-30 12:28

    I think a better way to group similar addresses would be to:

    1. create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)

    2. uppercase the address, replace anything other than [A-Z] or [0-9] with a space

    3. split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

    4. for each address (with id $target) find the most similar addresses:

      SELECT similar.id, similar.address, count(*) 
      FROM adress similar, word cmp, word src
      WHERE src.address_id=$target
      AND src.soundex=cmp.soundex
      AND cmp.address_id=similar.id
      ORDER BY count(*)
      LIMIT $some_value;
      
    5. calculate the levenstein difference between your source address and the top few values returned by the query.

    (doing any sort of operation on large arrays is often faster in databases)

    0 讨论(0)
  • 2021-01-30 12:30

    You can use a bk-tree to speed-up the search/comparison.

    http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:

    Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space.
    [...]
    Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test.
    [...]
    Tests show that searching with a distance of 1 queries no more than 5-8% of the tree, and searching with two errors queries no more than 17-25% of the tree - a substantial improvement over checking every node!

    edit: But that doesn't help you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Only with your brute-force m*n comparison.

    0 讨论(0)
  • 2021-01-30 12:31

    You could group them based on soundexes then limit the comparisons to the nearest N cases...

     $mashed=array();
     foreach ($address as $key=>$val) {
          $mashed[$key]=soundex($val);
     }
     sort($mashed);
    

    Then iterate through the keys of $mashed.

    C.

    0 讨论(0)
  • 2021-01-30 12:35

    Due to the nature of the Levenshtein algorithm (specifically the fact that it's a comparision between two strings), I can't see how this is possible.

    You could of course reduce the number of comparisons by performing some basic matching requirements first, but this is out of the scope of what you're asking.

    As a (quite possibly irrelevant) option, you could always use something like soundex which would let you pre-compute the string values. (You can also use it directly in MySQL I believe.)

    0 讨论(0)
  • 2021-01-30 12:35

    Given you problem, I see no other way than to compare every address with every other address if you want to use Lehvenstein distance.

    First of all, you should normalize addressess, get rid of abbreviations etc.

    • Ave -> Avenue
    • Rd. -> Road

    You could have some fixed max Lehvenstein distance ( N ) for similar addresses.

    If so, you could you could abort the Lehvenstein algorithm when you are sure that the edit distance for current address pair is larger than N. For this you need to write a custom version of Lehvenstein algorithm. This will make the algorithm a little quicker.

    There are also some related trivial optimizations. For example: if address A is 10 chars long and address B is 20 chars long and you consider addresses that have Lehvenstein distance of less than 8 to be similar. You can look lengths of addresses and immediately decide that they are not similar.

    0 讨论(0)
  • 2021-01-30 12:37

    If you want to find all similar values, you will have to compare all items to all others. But choosing the right array functions will speed things up significantly. Here is a quick example (the results array could have been better):

    $results = array();
    $count = count($entries);
    while ($count != 0) {
        # The entry to process
        $entry = array_shift($entries);
        # Get levenshtein distances to all others
        $result = array_map(
            'levenshtein',
            # array_map() needs two arrays, this one is an array consisting of
            # multiple entries of the value that we are processing
            array_fill($entry, 0, $count),
            $toCompare
        );
        $results[] = array($entry => $result);
        $count--;
    }
    
    0 讨论(0)
提交回复
热议问题