I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on
The tidytext answer is a good one, but there are tools in quanteda that can be adapted for this. The main function to count within a window is not kwic()
but rather fcm()
(feature co-occurrence matrix).
require(quanteda)
# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)
# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
## features
## features fire
## Far 1
## over 1
## the 5
## misty 1
## mountains 0
## cold 0
head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
## features
## features fire
## light 2
To get the average distance of the words from the target requires a bit of a hack of the weights function for distance. Below, the weights are applied to consider the counts according to the position, which provides a weighted mean when these are summed and then divided by the total frequency within the window. For your example of "light", for instance:
# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
## features
## light 9.5
## features fire
Getting minimum and maximum position is a bit more complicated, and while I can figure out a way to "hack" this using a combination of the weights to position a binary mask in each position then converting that to a distance. (Too ungainly to show, so I'm recommending the tidy solution unless I think of a more elegant way.)