问题
I want to be able to find similarity between two strings, weighting each token (word) with its inverse document frequency (those frequencies are not taken from those strings).
Using quanteda
I can create a dfm_tfidf
with inverted frequency weights, but do not know how to proceed after that.
Sample data :
ss = c(
"ibm madrid limited research",
"madrid limited research",
"limited research",
"research"
)
counts = list(ibm = 1, madrid = 2, limited = 3, research = 4)
cor = corpus(long_list_of_strings) ## the documents where we take words from
df = dfm(cor, tolower = T, verbose = T)
dfi = dfm_tfidf(df)
The goal is to find a function similarity
that will make:
res = similarity(dfi, "ibm limited", similarity_scheme = "simple matching")
with res in the form (random numbers for the example):
"ibm madrid limited research" 0.445
"madrid limited research" 0.2
"limited research" 0.76
"research" 0.45
Ideally would be to apply to those frequencies a function like :
sim = sum(Wc) / sqrt(sum(Wi)*sum(Wj))
where:
Wc
are the weights of words common to the two strings.Wi
and Wj
are the weights of words in string1 and string2.
回答1:
Here's a tidy
solution for your problem.
I use tidytext
for the nlp stuff, and widyr
to calculate cosine-similarity between the documents.
Note, I turned your original ss
vector into a tidy
dataframe with an ID
column. You can make that column whatever, but it will be what we use at the end to show similarity.
library(tidytext)
library(widyr)
# turn your original vector into a tibble with an ID column
ss <- c(
"ibm madrid limited research",
"madrid limited research",
"limited research",
"research",
"ee"
) %>% as.tibble() %>%
rowid_to_column("ID")
# create df of words & counts (tf-idf needs this)
ss_words <- ss %>%
unnest_tokens(words, value) %>%
count(ID, words, sort = TRUE)
# create tf-idf embeddings for your data
ss_tfidf <- ss_words %>%
bind_tf_idf(ID, words, n)
# return list of document similarity
ss_tfidf %>%
pairwise_similarity(ID, words, tf_idf, sort = TRUE)
The output for the above will be:
## A tibble: 12 x 3
# item1 item2 similarity
# <int> <int> <dbl>
# 1 3 2 0.640
# 2 2 3 0.640
# 3 4 3 0.6
# 4 3 4 0.6
# 5 2 1 0.545
# 6 1 2 0.545
# 7 4 2 0.384
# 8 2 4 0.384
# 9 3 1 0.349
#10 1 3 0.349
#11 4 1 0.210
#12 1 4 0.210
where item1
and item2
refer to the ID
column we created earlier.
There are some strange caveats with this answer. For example, notice I added the ee
token to your ss
vector: The pairwise_similarity
failed when there was one document with a single token. Strange behavior, but hopefully that gets you started.
回答2:
You want the textstat_simil()
function from quanteda. You should add the document to be targeted into the corpus, and then use the selection
argument to focus on that. "simple matching" is implemented as one of the similarity methods, but you should be aware that this looks for the presence or absence of terms, so tf-idf weighting will not affect this.
library("quanteda")
## Package version: 1.4.3
##
ss <- c(
"ibm limited",
"ibm madrid limited research",
"madrid limited research",
"limited research",
"research"
)
ssdfm <- dfm(ss)
ssdfm
## Document-feature matrix of: 5 documents, 4 features (40.0% sparse).
## 5 x 4 sparse Matrix of class "dfm"
## features
## docs ibm limited madrid research
## text1 1 1 0 0
## text2 1 1 1 1
## text3 0 1 1 1
## text4 0 1 0 1
## text5 0 0 0 1
dfm_tfidf(ssdfm)
## Document-feature matrix of: 5 documents, 4 features (40.0% sparse).
## 5 x 4 sparse Matrix of class "dfm"
## features
## docs ibm limited madrid research
## text1 0.39794 0.09691001 0 0
## text2 0.39794 0.09691001 0.39794 0.09691001
## text3 0 0.09691001 0.39794 0.09691001
## text4 0 0.09691001 0 0.09691001
## text5 0 0 0 0.09691001
Here, you can see that the result is unaffected by the tf-idf weighting:
dfm_tfidf(ssdfm) %>%
textstat_simil(method = "simple matching", selection = "text1") %>%
as.matrix()
## text1
## text1 1.00
## text2 0.50
## text3 0.25
## text4 0.50
## text5 0.25
ssdfm %>%
textstat_simil(method = "simple matching", selection = "text1") %>%
as.matrix()
## text1
## text1 1.00
## text2 0.50
## text3 0.25
## text4 0.50
## text5 0.25
回答3:
I had problems with the quanteda
and qdap
packages, so I built my own code to get a dataframe with individual words and frequency count. The code could be improved of course, but I think it shows how to do it.
library(RecordLinkage)
library(stringr)
library(dplyr)
searchstring = c(
"ibm madrid limited research",
"madrid limited research",
"limited research",
"research"
)
cleanInput <- function(x) {
x <- tolower(x)
x <- removePunctuation(x)
x <- stripWhitespace(x)
x <- gsub("-", "", x)
x <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
x <- gsub("[[:digit:]]+", "", x)
}
searchstring <- cleanInput(searchstring)
splitted <- str_split(searchstring, " ", simplify = TRUE)
df <- as.data.frame(as.vector(splitted))
df <- df[df$`as.vector(splitted)` != "", , drop = FALSE]
colnames(df)[1] <- "string"
result <- df %>%
group_by(string) %>%
summarise(n = n())
result$string <- as.character(result$string)
I first clean up the strings and then build a data.frame with it.
After I received my data.frame
, where exists a function called jarowinkler
from the RecordLinkage
package to measure similarity between two strings. It is vectorized and fast :-)
> jarowinkler(result$string, "ibm limited")
[1] 0.0000000 0.8303030 0.8311688 0.3383838 0.0000000
I hope this is what you desired :-)
来源:https://stackoverflow.com/questions/56365000/r-weighted-inverse-document-frequency-tfidf-similarity-between-strings