I have 2 vectors of dimensions 6 and I would like to have a number between 0 and 1.
a=c("HDa","2Pb","2","BxU","BuQ","Bve")
b=c("HCK","2Pb","2","09","F","G")
Can anyone explain what I should do?
using the lsa
package and the manual for this package
# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))
# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)
EDIT: show how is the mymatrix
object
myMatrix
#myMatrix
# docs
# terms D1 D2
# 2 1 1
# 2pb 1 1
# buq 1 0
# bve 1 0
# bxu 1 0
# hda 1 0
# 09 0 1
# f 0 1
# g 0 1
# hck 0 1
# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333
You need a dictionary of possible terms first and then convert your vectors to binary vectors with a 1 in the positions of the corresponding terms and 0 elsewhere. If you name the new vectors a2
and b2
, you can calculate the cosine similarly with cor(a2, b2)
, but notice the cosine similarly is between -1 and 1. You could map it to [0,1] with something like this: 0.5*cor(a2, b2) + 0.5
CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);
could be the better one for the larger data set.
来源:https://stackoverflow.com/questions/34045738/how-can-i-calculate-cosine-similarity-between-two-strings-vectors