Longest Common Substring with wrong character tolerance

纵然是瞬间 提交于 2019-12-07 11:38:55

问题


I have a script I found on here that works well when looking for the Lowest Common Substring.

However, I need it to tolerate some incorrect/missing characters. I would like be able to either input a percentage of similarity required, or perhaps specify the number of missing/wrong characters allowable.

For example, I want to find this string:

big yellow school bus

inside of this string:

they rode the bigyellow schook bus that afternoon

This is the code i'm currently using:

function longest_common_substring($words) {
    $words = array_map('strtolower', array_map('trim', $words));
    $sort_by_strlen = create_function('$a, $b', 'if (strlen($a) == strlen($b)) { return strcmp($a, $b); } return (strlen($a) < strlen($b)) ? -1 : 1;');
    usort($words, $sort_by_strlen);

    // We have to assume that each string has something in common with the first
    // string (post sort), we just need to figure out what the longest common
    // string is. If any string DOES NOT have something in common with the first
    // string, return false.
    $longest_common_substring = array();
    $shortest_string = str_split(array_shift($words));

    while (sizeof($shortest_string)) {
        array_unshift($longest_common_substring, '');
        foreach ($shortest_string as $ci => $char) {
            foreach ($words as $wi => $word) {
                if (!strstr($word, $longest_common_substring[0] . $char)) {
                    // No match
                    break 2;
                }
            }
            // we found the current char in each word, so add it to the first longest_common_substring element,
            // then start checking again using the next char as well
            $longest_common_substring[0].= $char;
        }
        // We've finished looping through the entire shortest_string.
        // Remove the first char and start all over. Do this until there are no more
        // chars to search on.
        array_shift($shortest_string);
    }

    // If we made it here then we've run through everything
    usort($longest_common_substring, $sort_by_strlen);

    return array_pop($longest_common_substring);
}

Any help is much appreciated.

UPDATE

The PHP levenshtein function is limited to 255 characters, and some of the haystacks i'm searching are 1000+ characters.


回答1:


Writing this as a second answer because it's not based on my previous (bad) one at all.

This code is based on http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm and http://en.wikipedia.org/wiki/Approximate_string_matching#Problem_formulation_and_algorithms

It returns one (of potentially several) minimum-levenshtein substrings of $haystack, given $needle. Now, levenshtein distance is just one measure of edit distance and it may not actually suit your needs. 'hte' is closer on this metric to 'he' than it is to 'the'. Some of the examples I put in show the limitations of this technique. I believe this to be considerably more reliable than the previous answer I gave, but let me know how it works for you.

// utility function - returns the key of the array minimum
function array_min_key($arr)
{
    $min_key = null;
    $min = PHP_INT_MAX;
    foreach($arr as $k => $v) {
        if ($v < $min) {
            $min = $v;
            $min_key = $k;
        }
    }
    return $min_key;
}

// Calculate the edit distance between two strings
function edit_distance($string1, $string2)
{
    $m = strlen($string1);
    $n = strlen($string2);
    $d = array();

    // the distance from '' to substr(string,$i)
    for($i=0;$i<=$m;$i++) $d[$i][0] = $i;
    for($i=0;$i<=$n;$i++) $d[0][$i] = $i;

    // fill-in the edit distance matrix
    for($j=1; $j<=$n; $j++)
    {
        for($i=1; $i<=$m; $i++)
        {
            // Using, for example, the levenshtein distance as edit distance
            list($p_i,$p_j,$cost) = levenshtein_weighting($i,$j,$d,$string1,$string2);
            $d[$i][$j] = $d[$p_i][$p_j]+$cost;
        }
    }

    return $d[$m][$n];
}

// Helper function for edit_distance()
function levenshtein_weighting($i,$j,$d,$string1,$string2)
{
    // if the two letters are equal, cost is 0
    if($string1[$i-1] === $string2[$j-1]) {
        return array($i-1,$j-1,0);
    }

    // cost we assign each operation
    $cost['delete'] = 1;
    $cost['insert'] = 1;
    $cost['substitute'] = 1;

    // cost of operation + cost to get to the substring we perform it on
    $total_cost['delete'] = $d[$i-1][$j] + $cost['delete'];
    $total_cost['insert'] = $d[$i][$j-1] + $cost['insert'];
    $total_cost['substitute'] = $d[$i-1][$j-1] + $cost['substitute'];

    // return the parent array keys of $d and the operation's cost
    $min_key = array_min_key($total_cost);
    if ($min_key == 'delete') {
        return array($i-1,$j,$cost['delete']);
    } elseif($min_key == 'insert') {
        return array($i,$j-1,$cost['insert']);
    } else {
        return array($i-1,$j-1,$cost['substitute']);
    }
}

// attempt to find the substring of $haystack most closely matching $needle
function shortest_edit_substring($needle, $haystack)
{
    // initialize edit distance matrix
    $m = strlen($needle);
    $n = strlen($haystack);
    $d = array();
    for($i=0;$i<=$m;$i++) {
        $d[$i][0] = $i;
        $backtrace[$i][0] = null;
    }
    // instead of strlen, we initialize the top row to all 0's
    for($i=0;$i<=$n;$i++) {
        $d[0][$i] = 0;
        $backtrace[0][$i] = null;
    }

    // same as the edit_distance calculation, but keep track of how we got there
    for($j=1; $j<=$n; $j++)
    {
        for($i=1; $i<=$m; $i++)
        {
            list($p_i,$p_j,$cost) = levenshtein_weighting($i,$j,$d,$needle,$haystack);
            $d[$i][$j] = $d[$p_i][$p_j]+$cost;
            $backtrace[$i][$j] = array($p_i,$p_j);
        }
    }

    // now find the minimum at the bottom row
    $min_key = array_min_key($d[$m]);
    $current = array($m,$min_key);
    $parent = $backtrace[$m][$min_key];

    // trace up path to the top row
    while(! is_null($parent)) {
        $current = $parent;
        $parent = $backtrace[$current[0]][$current[1]];
    }

    // and take a substring based on those results
    $start = $current[1];
    $end = $min_key;
    return substr($haystack,$start,$end-$start);
}

// some testing
$data = array( array('foo',' foo'), array('fat','far'), array('dat burn','rugburn'));
$data[] = array('big yellow school bus','they rode the bigyellow schook bus that afternoon');
$data[] = array('bus','they rode the bigyellow schook bus that afternoon');
$data[] = array('big','they rode the bigyellow schook bus that afternoon');
$data[] = array('nook','they rode the bigyellow schook bus that afternoon');
$data[] = array('they','console, controller and games are all in very good condition, only played occasionally. includes power cable, controller charge cable and audio cable. smoke free house. pes 2011 super street fighter');
$data[] = array('controker','console, controller and games are all in very good condition, only played occasionally. includes power cable, controller charge cable and audio cable. smoke free house. pes 2011 super street fighter');

foreach($data as $dat) {
    $substring = shortest_edit_substring($dat[0],$dat[1]);
    $dist = edit_distance($dat[0],$substring);
    printf("Found |%s| in |%s|, matching |%s| with edit distance %d\n",$substring,$dat[1],$dat[0],$dist);
}


来源:https://stackoverflow.com/questions/12742348/longest-common-substring-with-wrong-character-tolerance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!