问题
I'm new here and my questions is of mathematical rather than programming nature where I would like to get a second opinion on whether my approach makes sense.
I was trying to find associations between words in my corpus using the function findAssocs
, from the tm
package. Even though it appears to perform reasonably well on the data available through the package, such as New York Times and US Congress, I was disappointed with its performance on my own, less tidy dataset. It appears to be prone being distorted by rare document that contain several repetitions of the same words which seems to create a strong association between them. I've found that the cosine measure gives a better picture of how related the terms are, even though based on literature, it only tends to be used to measure the similarity of documents rather than terms. Let's use USCongress data from RTextTools
package to demonstrate what I mean:
First, I'll set everything up...
data(USCongress)
text = as.character(USCongress$text)
corp = Corpus(VectorSource(text))
parameters = list(minDocFreq = 1,
wordLengths = c(2,Inf),
tolower = TRUE,
stripWhitespace = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stemming = TRUE,
stopwords = TRUE,
tokenize = NULL,
weighting = function(x) weightSMART(x,spec="ltn"))
tdm = TermDocumentMatrix(corp,control=parameters)
Let's say we are interested to investigate the relationship between "Government" and "Foreign":
# Government: appears in 37 docs and between then it appears 43 times
length(which(text %like% " government"))
sum(str_count(text,"government"))
# Foreign: appears in 49 document and between then it appears 56 times
length(which(text %like% "foreign"))
sum(str_count(text,"foreign"))
length(which(text[which(text %like% "government")] %like% "foreign"))
# together they appear 3 times
# looking for "foreign" and "government"
head(as.data.frame(findAssocs(tdm,"foreign",0.1)),n=10)
findAssocs(tdm, "foreign", 0.1)
countri 0.34
lookthru 0.30
tuberculosi 0.26
carryforward 0.24
cor 0.24
malaria 0.23
hivaid 0.20
assist 0.19
coo 0.19
corrupt 0.19
# they do not appear to be associated
Now let's add another document that contains "foreign government" repeated 50 times:
text[4450] = gsub("(.*)",paste(rep("\\1",50),collapse=" "),"foreign government")
corp = Corpus(VectorSource(text))
tdm = TermDocumentMatrix(corp,control=parameters)
#running the association again:
head(as.data.frame(findAssocs(tdm,"foreign",0.1)),n=10)
findAssocs(tdm, "foreign", 0.1)
govern 0.30
countri 0.29
lookthru 0.26
tuberculosi 0.22
cor 0.21
carryforward 0.20
malaria 0.19
hivaid 0.17
assist 0.16
coo 0.16
As you can see, now it's a different story and it all comes down to a single document.
Here I would like to do something unconventional: use the cosine to find similarity between terms sitting in the document space. This measure tend to be used to find similarity between documents rather than terms, but I see no reason why it can't be used to find similarity between words. In a conventional sense, documents are the vectors while terms are the axes and we can detect their similarity based on the angle between these documents. But a Term Document Matrix is a transpose of a Document Term Matrix, and similarly, we can project terms in the document space, i.e. let your documents be the axes and your terms the vectors between which you can measure the angle. It doesn't seem to suffer from the same drawbacks as the simple correlation:
cosine(as.vector(tdm["government",]),as.vector(tdm["foreign",]))
[,1]
[1,] 0
Other than that, the 2 measures appear to be very similar:
tdm.reduced = removeSparseTerms(tdm,0.98)
Proximity = function(tdm){
d = dim(tdm)[1]
r = matrix(0,d,d,dimnames=list(rownames(tdm),rownames(tdm)))
for(i in 1:d){
s = seq(1:d)[-c(1:(i-1))]
for(j in 1:length(s)){
r[i,s[j]] = cosine(as.vector(tdm[i,]),as.vector(tdm[s[j],]))
r[s[j],i] = r[i,s[j]]
}
}
diag(r) = 0
return(r)
}
rmat = Proximity(tdm.reduced)
# findAssocs method
head(as.data.frame(sort(findAssocs(tdm.reduced,"fund",0),decreasing=T)),n=10)
sort(findAssocs(tdm.reduced, "fund", 0), decreasing = T)
use 0.11
feder 0.10
insur 0.09
author 0.07
project 0.05
provid 0.05
fiscal 0.04
govern 0.04
secur 0.04
depart 0.03
# cosine method
head(as.data.frame(round(sort(rmat[,"fund"],decreasing=T),2)),n=10)
round(sort(rmat[, "fund"], decreasing = T), 2)
use 0.15
feder 0.14
bill 0.14
provid 0.13
author 0.12
insur 0.11
state 0.10
secur 0.09
purpos 0.09
amend 0.09
Surprisingly, though, I haven't seen cosine being used to detect similarities between terms, which makes me wonder if I've missed something important. Perhaps this method is flawed in a way I haven't though of. So any thoughts on what I've done would be very much appreciated.
If you've made it that far, thanks for reading!!
Cheers
回答1:
If I understand your query (which should be on stack exchange I think). I believe the issue is that term distances in findAssocs
is using Euclidean measurement. So a document that that is simply double the words becomes an outlier and considered much different in the distance measurement.
Switching to cosine as a measure for documents is widely used so I suspect terms are ok too. I like the skmeans
package for clustering documents by cosine. Spherical K-Means will accept a TDM directly and does cosine distance with unit length.
This video at ~11m in shows it in case you don't already know. Hope that was a bit helpful...in the end I believe cosine is acceptable.
来源:https://stackoverflow.com/questions/21357656/tm-package-findassocs-vs-cosine