Cosine similarity vs Hamming distance [closed]

后端未结

关注

 4  1215

旧巷少年郎

相关标签:

4条回答

天涯浪人

2020-12-04 14:09
A Hamming distance should be done between two strings of equal length and with the order taken into account.

As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)

Here is a cosine similarity function of 2 arrays of words:
```
function cosineSimilarity($tokensA, $tokensB)
{
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();

    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}
```
It is fast (isset() instead of in_array() is a killer on large arrays).

As you can see, the results does not take into account the "magnitude" of each the word.

I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)

The best link about string similarity metrics: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

For further interesting readings:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298
0 讨论(0)
发布评论:

提交评论
- 加载中...

终归单人心

2020-12-04 14:12

Here my corrected code for Cosine Distance function posted by Toto

function cosineSimilarity($tokensA, $tokensB)
{
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();

    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += pow($x,2);
        $c += pow($y,2);
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

0 讨论(0)

一个人的身影

2020-12-04 14:23
Unless I'm mistaken, I think you've got an algorithm halfway between the two algorithms. For Hamming distance, use:
```
function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += 1;
    }
    return $totalScore * 500 / (count($terms1) * count($terms2));
}
```
(Note that you're only adding 1 for each matched element in the token vectors.)

And for cosine similarity, use:
```
function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $counts2 = array_count_values($terms2);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += $counts1[$term] * $counts2[$term];
    }
    return $totalScore / (count($terms1) * count($terms2));
}
```
(Note that you're adding the product of the token counts between the two documents.)

The main difference between the two is that cosine similarity will yield a stronger indicator when two documents have the same word multiple times in the documents, while Hamming distance doesn't care how often the individual tokens come up.

Edit: just noticed your query about removing function words etc. I do advise this if you're going to use cosine similarity - as function words are quite frequent (in English, at least), you might skew a result by not filtering them out. If you use Hamming distance, the effect will not be quite as great, but it could still be appreciable in some cases. Also, if you have access to a lemmatizer, it will reduce the misses when one document contains "galaxies" and the other contains "galaxy", for instance.

Whichever way you go, good luck!
0 讨论(0)
发布评论:

提交评论
- 加载中...

时光说笑

2020-12-04 14:26

I apologize for ignoring the fact that you said that you didn't want to use any other algorithms, but seriously, Levenshtein distance and Damerau-Levenshtein distance are way more freakin' useful than Hamming distance. Here's a D-L distance implementation in PHP, and if you don't like PHP's native levenshtein() function, which I think you won't because it has a length limit, here's a non-length-limited version:

function levenshtein_distance($text1, $text2) {
    $len1 = strlen($text1);
    $len2 = strlen($text2);
    for($i = 0; $i <= $len1; $i++)
        $distance[$i][0] = $i;
    for($j = 0; $j <= $len2; $j++)
        $distance[0][$j] = $j;
    for($i = 1; $i <= $len1; $i++)
        for($j = 1; $j <= $len2; $j++)
            $distance[$i][$j] = min($distance[$i - 1][$j] + 1, $distance[$i][$j - 1] + 1, $distance[$i - 1][$j - 1] + ($text1[$i - 1] != $text2[$j - 1]));
    return $distance[$len1][$len2];
}

0 讨论(0)

热议问题