问题
I calculated tf/idf values of two documents. The following are the tf/idf values:
1.txt
0.0
0.5
2.txt
0.0
0.5
The documents are like:
1.txt = > dog cat
2.txt = > cat elephant
How can I use these values to calculate cosine similarity?
I know that I should calculate the dot product, then find distance and divide dot product by it. How can I calculate this using my values?
One more question: Is it important that both documents should have same number of words?
回答1:
a * b
sim(a,b) =--------
|a|*|b|
a*b is dot product
some details:
def dot(a,b):
n = length(a)
sum = 0
for i in xrange(n):
sum += a[i] * b[i];
return sum
def norm(a):
n = length(a)
for i in xrange(n):
sum += a[i] * a[i]
return math.sqrt(sum)
def cossim(a,b):
return dot(a,b) / (norm(a) * norm(b))
yes. to some extent, a and b must have the same length. but a and b usually have sparse representation, you only need to store non-zero entries and you can calculate norm and dot more fast.
回答2:
simple java code implementation:
static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
Set<String> both = Sets.newHashSet(v1.keySet());
both.retainAll(v2.keySet());
double sclar = 0, norm1 = 0, norm2 = 0;
for (String k : both) sclar += v1.get(k) * v2.get(k);
for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
return sclar / Math.sqrt(norm1 * norm2);
}
回答3:
1) Calculate tf-idf ( Generally better than t-f alone but completely depends on your data set and requirement)
From wiki ( regarding idf )
An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.
2) No , it is not important that both the documents have same number of words.
3) You can find tf-idf
or cosine-similarity
in any language now days by invoking some machine learning library function. I prefer python
Python code to calculate tf-idf and cosine-similarity ( using scikit-learn 0.18.2 )
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# example dataset
from sklearn.datasets import fetch_20newsgroups
# replace with your method to get data
example_data = fetch_20newsgroups(subset='all').data
max_features_for_tfidf = 10000
is_idf = True
vectorizer = TfidfVectorizer(max_df=0.5, max_features=max_features_for_tf_idf,
min_df=2, stop_words='english',
use_idf=is_idf)
X_Mat = vectorizer.fit_transform(example_data)
# calculate cosine similarity between samples in X with samples in Y
cosine_sim = cosine_similarity(X=X_Mat, Y=X_Mat)
4) You might be interested in truncated Singular Value Decomposition (SVD)
来源:https://stackoverflow.com/questions/1997750/cosine-similarity