问题
I am playing around in R to find the tf-idf
values.
I have a set of documents
like:
D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."
I want to create a matrix like this:
Docs blue bright sky sun
D1 tf-idf 0.0000000 tf-idf 0.0000000
D2 0.0000000 tf-idf 0.0000000 tf-idf
D3 0.0000000 tf-idf tf-idf tf-idf
So, my code in R
:
library(tm)
docs <- c(D1 = "The sky is blue.", D2 = "The sun is bright.", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs)) #Make a corpus object from a text vector
#Clean the text
dd <- tm_map(dd, stripWhitespace)
dd <- tm_map(dd, tolower)
dd <- tm_map(dd, removePunctuation)
dd <- tm_map(dd, removeWords, stopwords("english"))
dd <- tm_map(dd, stemDocument)
dd <- tm_map(dd, removeNumbers)
inspect(dd)
A corpus with 3 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$D1
sky blue
$D2
sun bright
$D3
sun sky bright
> dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
> as.matrix(dtm)
Terms
Docs blue bright sky sun
D1 0.7924813 0.0000000 0.2924813 0.0000000
D2 0.0000000 0.2924813 0.0000000 0.2924813
D3 0.0000000 0.1949875 0.1949875 0.1949875
If I do a hand calculation then the matrix should be:
Docs blue bright sky sun
D1 0.2385 0.0000000 0.3521 0.0000000
D2 0.0000000 0.3521 0.0000000 0.3521
D3 0.0000000 0.1949875 0.058 0.058
I am calculating like say blue
as tf
= 1/2 = 0.5
and idf
as log(3/1) = 0.477121255
. Therefore tf-idf = tf*idf = 0.5*0.477 = 0.2385
. In this way, I am calculating the other tf-idf
values. Now, I am wondering, why I am getting different results in the matrix of hand calculation and in the matrix of R? Which gives the correct results? Am I doing something wrong in hand calculation or is there something wrong in my R code?
回答1:
The reason your hand calculation doesn't agree with the DocumentTermMatrix calculation is you are using a different log
base. When you say log(3/1) = 0.477121255
you must be using log base 10. In R, that would be log10(3)
. The default log
in R is natural log so if you type log(3)
in R you get ~1.10. But the weightTfIdf uses log base 2 for its calculations. Thus when calculating tf-idf for "blue" you get
(1/2)*log2(3/1) = 0.7924813
I hope that clears things up.
来源:https://stackoverflow.com/questions/24011395/different-tf-idf-values-in-r-and-hand-calculation