Unless I'm mistaken, I think you've got an algorithm halfway between the two algorithms. For Hamming distance, use:
function check ($terms1, $terms2) {
$counts1 = array_count_values($terms1);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += 1;
}
return $totalScore * 500 / (count($terms1) * count($terms2));
}
(Note that you're only adding 1 for each matched element in the token vectors.)
And for cosine similarity, use:
function check ($terms1, $terms2) {
$counts1 = array_count_values($terms1);
$counts2 = array_count_values($terms2);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += $counts1[$term] * $counts2[$term];
}
return $totalScore / (count($terms1) * count($terms2));
}
(Note that you're adding the product of the token counts between the two documents.)
The main difference between the two is that cosine similarity will yield a stronger indicator when two documents have the same word multiple times in the documents, while Hamming distance doesn't care how often the individual tokens come up.
Edit: just noticed your query about removing function words etc. I do advise this if you're going to use cosine similarity - as function words are quite frequent (in English, at least), you might skew a result by not filtering them out. If you use Hamming distance, the effect will not be quite as great, but it could still be appreciable in some cases. Also, if you have access to a lemmatizer, it will reduce the misses when one document contains "galaxies" and the other contains "galaxy", for instance.
Whichever way you go, good luck!