I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters
You can start by tokenizing the sentence and hashing the corresponding list of words to transform your sentences into list of integers, and then use seq_dist()
to calculate the distance.
library(hashr); library(stringdist)
f <- 'cat dog person'
g <- 'cat dog ufo'
seq_dist(hash(strsplit(f, "\\s+")), hash(strsplit(g, "\\s+")), method = "jaccard", q = 2)
[1] 0.6666667