R: weighted inverse document frequency (tfidf) similarity between strings

為{幸葍}努か 提交于 2020-08-09 08:17:12


I want to be able to find similarity between two strings, weighting each token (word) with its inverse document frequency (those frequencies are not taken from those strings).

Using quanteda I can create a dfm_tfidf with inverted frequency weights, but do not know how to proceed after that.

Sample data :

ss = c(
        "ibm madrid limited research", 
        "madrid limited research", 
        "limited research",
counts = list(ibm = 1, madrid = 2, limited = 3, research = 4)
cor = corpus(long_list_of_strings)  ## the documents where we take words from
df = dfm(cor, tolower = T, verbose = T)
dfi = dfm_tfidf(df)

The goal is to find a function similarity that will make:

res = similarity(dfi, "ibm limited", similarity_scheme = "simple matching")

with res in the form (random numbers for the example):

"ibm madrid limited research"  0.445
"madrid limited research" 0.2
"limited research" 0.76
"research" 0.45

Ideally would be to apply to those frequencies a function like :

sim = sum(Wc) / sqrt(sum(Wi)*sum(Wj)) 

where: Wc are the weights of words common to the two strings.
Wi and Wj are the weights of words in string1 and string2.


Here's a tidy solution for your problem.

I use tidytext for the nlp stuff, and widyr to calculate cosine-similarity between the documents.

Note, I turned your original ss vector into a tidy dataframe with an ID column. You can make that column whatever, but it will be what we use at the end to show similarity.


# turn your original vector into a tibble with an ID column
ss <- c(
  "ibm madrid limited research", 
  "madrid limited research", 
  "limited research",
) %>% as.tibble() %>% 

# create df of words & counts (tf-idf needs this)
ss_words <- ss %>% 
  unnest_tokens(words, value) %>% 
  count(ID, words, sort = TRUE)

# create tf-idf embeddings for your data
ss_tfidf <- ss_words %>% 
  bind_tf_idf(ID, words, n)

# return list of document similarity
ss_tfidf %>% 
  pairwise_similarity(ID, words, tf_idf, sort = TRUE)

The output for the above will be:

## A tibble: 12 x 3
#   item1 item2 similarity
#   <int> <int>      <dbl>
# 1     3     2      0.640
# 2     2     3      0.640
# 3     4     3      0.6  
# 4     3     4      0.6  
# 5     2     1      0.545
# 6     1     2      0.545
# 7     4     2      0.384
# 8     2     4      0.384
# 9     3     1      0.349
#10     1     3      0.349
#11     4     1      0.210
#12     1     4      0.210

where item1 and item2 refer to the ID column we created earlier.

There are some strange caveats with this answer. For example, notice I added the ee token to your ss vector: The pairwise_similarity failed when there was one document with a single token. Strange behavior, but hopefully that gets you started.


You want the textstat_simil() function from quanteda. You should add the document to be targeted into the corpus, and then use the selection argument to focus on that. "simple matching" is implemented as one of the similarity methods, but you should be aware that this looks for the presence or absence of terms, so tf-idf weighting will not affect this.

## Package version: 1.4.3
ss <- c(
  "ibm limited",
  "ibm madrid limited research",
  "madrid limited research",
  "limited research",
ssdfm <- dfm(ss)
## Document-feature matrix of: 5 documents, 4 features (40.0% sparse).
## 5 x 4 sparse Matrix of class "dfm"
##        features
## docs    ibm limited madrid research
##   text1   1       1      0        0
##   text2   1       1      1        1
##   text3   0       1      1        1
##   text4   0       1      0        1
##   text5   0       0      0        1
## Document-feature matrix of: 5 documents, 4 features (40.0% sparse).
## 5 x 4 sparse Matrix of class "dfm"
##        features
## docs        ibm    limited  madrid   research
##   text1 0.39794 0.09691001 0       0         
##   text2 0.39794 0.09691001 0.39794 0.09691001
##   text3 0       0.09691001 0.39794 0.09691001
##   text4 0       0.09691001 0       0.09691001
##   text5 0       0          0       0.09691001

Here, you can see that the result is unaffected by the tf-idf weighting:

dfm_tfidf(ssdfm) %>%
  textstat_simil(method = "simple matching", selection = "text1") %>%
##       text1
## text1  1.00
## text2  0.50
## text3  0.25
## text4  0.50
## text5  0.25

ssdfm %>%
  textstat_simil(method = "simple matching", selection = "text1") %>%
##       text1
## text1  1.00
## text2  0.50
## text3  0.25
## text4  0.50
## text5  0.25


I had problems with the quanteda and qdap packages, so I built my own code to get a dataframe with individual words and frequency count. The code could be improved of course, but I think it shows how to do it.


searchstring = c(
  "ibm madrid limited research", 
  "madrid limited research", 
  "limited research",

cleanInput <- function(x) {
  x <- tolower(x)
  x <- removePunctuation(x)
  x <- stripWhitespace(x)
  x <- gsub("-", "", x)
  x <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
  x <- gsub("[[:digit:]]+", "", x)

searchstring <- cleanInput(searchstring)
splitted <- str_split(searchstring, " ", simplify = TRUE)
df <- as.data.frame(as.vector(splitted))
df <- df[df$`as.vector(splitted)` != "", , drop = FALSE]
colnames(df)[1] <- "string"
result <- df %>%
  group_by(string) %>%
  summarise(n = n())
result$string <- as.character(result$string)

I first clean up the strings and then build a data.frame with it.

After I received my data.frame, where exists a function called jarowinkler from the RecordLinkage package to measure similarity between two strings. It is vectorized and fast :-)

> jarowinkler(result$string, "ibm limited")
[1] 0.0000000 0.8303030 0.8311688 0.3383838 0.0000000

I hope this is what you desired :-)

