Jaccard similarity in stringdist package to match words in character string

后端 未结 1 499
孤独总比滥情好
孤独总比滥情好 2021-02-06 17:21

I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters

1条回答
  •  逝去的感伤
    2021-02-06 17:47

    You can start by tokenizing the sentence and hashing the corresponding list of words to transform your sentences into list of integers, and then use seq_dist() to calculate the distance.

    library(hashr); library(stringdist)
    f <- 'cat dog person'
    g <- 'cat dog ufo'
    seq_dist(hash(strsplit(f, "\\s+")), hash(strsplit(g, "\\s+")), method = "jaccard", q = 2)
    [1] 0.6666667
    

    0 讨论(0)
提交回复
热议问题