R: Quanteda's textstat_simil function

问题

I am using Quanteda's textstat_simil to compute semantic relatedness in a text corpus. The use of this function is explained here: https://rdrr.io/cran/quanteda/man/textstat_simil.html

This is a running example and it works fine:

# compute term similarities
pres_dfm <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords("english"))
(s1 <- textstat_simil(pres_dfm, c("fair", "health", "terror"), method = "cosine", margin = "features"))
head(as.matrix(s1, 10)
as.list(s1, n=8)

I have two questions.

First question: what weighting scheme has been applied to the dfm's frequencies before computing the cosine similarity? Normally, in distributional models like this one, similarity measures (eg. cosine, dice, etc) are computed on weighed frequencies, and not on raw frequencies. Common weighing schemes are: PPMI (Positive Pointwise Mutual Information, TF/IDF, etc). Which weighing scheme has been applied here? Is it possible to use another scheme, if needed?

Second question: where can I find more details about how textstat_simil options have been implemented in Quanteda? Namely, textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamann", and "faith". In particular, I would like to know how simple matching, edice and ejaccard are computed.

Thanks in advance for your answers.

Cheers, Marina

回答1:

1) Unless you weight the dfm first using dfm_weight(), the dfm that is input to textstat_simil() will be raw counts. (For cosine similarity, this produces the same result as relative term frequencies, since it is based on the angle between vectors rather than the distance between multi-dimensional coordinates.)

2) The source code for the methods can be viewed here, where the formula are presented in simple form in the comments to the specific functions.

来源：https://stackoverflow.com/questions/50022984/r-quantedas-textstat-simil-function

标签

quanteda