How to calculate proximity of words to a specific term in a document

前端未结

关注

 2  678

情深已故 2021-01-23 01:47

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on

2条回答

北荒 (楼主)

2021-01-23 02:35

The tidytext answer is a good one, but there are tools in quanteda that can be adapted for this. The main function to count within a window is not kwic() but rather fcm() (feature co-occurrence matrix).

require(quanteda)

# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)

# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
##            features
## features    fire
##   Far          1
##   over         1
##   the          5
##   misty        1
##   mountains    0
##   cold         0

head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
##         features
## features fire
##    light    2

To get the average distance of the words from the target requires a bit of a hack of the weights function for distance. Below, the weights are applied to consider the counts according to the position, which provides a weighted mean when these are summed and then divided by the total frequency within the window. For your example of "light", for instance:

# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
    fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
##         features
##    light  9.5
## features fire

Getting minimum and maximum position is a bit more complicated, and while I can figure out a way to "hack" this using a combination of the weights to position a binary mask in each position then converting that to a distance. (Too ungainly to show, so I'm recommending the tidy solution unless I think of a more elegant way.)

0 讨论(0)

查看其它2个回答