Using Jaro-Winkler, is distance between A and B the same as B and A?

北城以北 提交于 2019-12-13 03:19:22

问题


I'm using the following class to calculate the Jaro-Winkler distance between two strings. What I'm noticing is that the distance calculated between string A and B is not always the same as string B and A. Is this to be expected?

RAMADI ~ TRADING
0.73492063492063

TRADING ~ RAMADI
0.71825396825397

Demo


回答1:


Turns out, there is a bug in the PHP versions of the Jaro-Winkler string comparison method found many places online.

Currently, string A compared to string B will yield a different result to string B compared to string A, when either string A or B contains a character found in both strings, that is found more than once in one of the string. This is incorrect. The Jaro-Winkler method should yield the same result when comparing the match value from A compared to B with B compared to A.

To rectify this, when identifying the common characters, the same character should not be repeated. The common characters variable needs to be deduplicated before returned.

The below code replaces the common characters string with an array that uses the common character as the key, to avoid duplication. By using the code below, A compared to B yields the same results as B compared to A.

This is inline with the C# version of the method.

//$commonCharacters='';
# The Common Characters variable must be an array
$commonCharacters = [];
for( $i=0; $i < $str1_len; $i++){
    $noMatch = True;
    // compare if char does match inside given allowedDistance
    // and if it does add it to commonCharacters
    for( $j= max( 0, $i-$allowedDistance ); $noMatch && $j < min( $i + $allowedDistance + 1, $str2_len ); $j++) {
        if( $temp_string2[(int)$j] == $string1[$i] ){ // MJR
            $noMatch = False;
            //$commonCharacters .= $string1[$i];
            # The Common Characters array uses the character as a key to avoid duplication.
            $commonCharacters[$string1[$i]] = $string1[$i];
            $temp_string2[(int)$j] = ''; // MJR
        }
    }
}
//return $commonCharacters;
# When returning, turn the array back to a string, as expected
return implode("", $commonCharacters);


来源:https://stackoverflow.com/questions/57053773/using-jaro-winkler-is-distance-between-a-and-b-the-same-as-b-and-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!