R: Quanteda's textstat_simil function

倾然丶 夕夏残阳落幕 提交于 2020-03-21 06:41:05

问题


I am using Quanteda's textstat_simil to compute semantic relatedness in a text corpus. The use of this function is explained here: https://rdrr.io/cran/quanteda/man/textstat_simil.html

This is a running example and it works fine:

# compute term similarities
pres_dfm <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords("english"))
(s1 <- textstat_simil(pres_dfm, c("fair", "health", "terror"), method = "cosine", margin = "features"))
head(as.matrix(s1, 10)
as.list(s1, n=8)

I have two questions.

First question: what weighting scheme has been applied to the dfm's frequencies before computing the cosine similarity? Normally, in distributional models like this one, similarity measures (eg. cosine, dice, etc) are computed on weighed frequencies, and not on raw frequencies. Common weighing schemes are: PPMI (Positive Pointwise Mutual Information, TF/IDF, etc). Which weighing scheme has been applied here? Is it possible to use another scheme, if needed?

Second question: where can I find more details about how textstat_simil options have been implemented in Quanteda? Namely, textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamann", and "faith". In particular, I would like to know how simple matching, edice and ejaccard are computed.

Thanks in advance for your answers.

Cheers, Marina


回答1:


1) Unless you weight the dfm first using dfm_weight(), the dfm that is input to textstat_simil() will be raw counts. (For cosine similarity, this produces the same result as relative term frequencies, since it is based on the angle between vectors rather than the distance between multi-dimensional coordinates.)

2) The source code for the methods can be viewed here, where the formula are presented in simple form in the comments to the specific functions.



来源:https://stackoverflow.com/questions/50022984/r-quantedas-textstat-simil-function

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!