Jaccard similarity in stringdist package to match words in character string

问题

I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters within a character string.

c <- c('cat', 'dog', 'person')
d <- c('cat', 'dog', 'ufo')

stringdist(c, d, method='jaccard', q=2)
[1] 0 0 1

So we see here that it calculates the similarity of 'cat' and 'cat', 'dog' and 'dog' and 'person' and 'ufo'.

I also tried converting the words into 1 long text string. The following approaches what I need, but it's still calculating 1 - (number of shared 2-grams / number of total unique 2-grams):

f <- 'cat dog person'
g <- 'cat dog ufo'
stringdist(f, g, method='jaccard', q=2)
[1] 0.5625

How would I get it to calculate similarity by the words?

回答1:

You can start by tokenizing the sentence and hashing the corresponding list of words to transform your sentences into list of integers, and then use seq_dist() to calculate the distance.

library(hashr); library(stringdist)
f <- 'cat dog person'
g <- 'cat dog ufo'
seq_dist(hash(strsplit(f, "\\s+")), hash(strsplit(g, "\\s+")), method = "jaccard", q = 2)
[1] 0.6666667

来源：https://stackoverflow.com/questions/37143944/jaccard-similarity-in-stringdist-package-to-match-words-in-character-string

标签

text

stringdist

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!