I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on
I'd suggest solving this with a combination of my tidytext and fuzzyjoin packages.
You can start by tokenizing it into a one-row-per-word data frame, adding a position
column, and removing stopwords:
library(tidytext)
library(dplyr)
all_words <- data_frame(text = song) %>%
unnest_tokens(word, text) %>%
mutate(position = row_number()) %>%
filter(!word %in% tm::stopwords("en"))
You can then find just the word fire
, and use difference_inner_join()
from fuzzyjoin to find all rows within 15 words of those rows. You can then use group_by()
and summarize()
to get your desired statistics for each word.
library(fuzzyjoin)
nearby_words <- all_words %>%
filter(word == "fire") %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
mutate(distance = abs(focus_position - position))
words_summarized <- nearby_words %>%
group_by(word) %>%
summarize(number = n(),
maximum_distance = max(distance),
minimum_distance = min(distance),
average_distance = mean(distance)) %>%
arrange(desc(number))
Output in this case:
# A tibble: 49 × 5
word number maximum_distance minimum_distance average_distance
<chr> <int> <dbl> <dbl> <dbl>
1 fire 3 0 0 0.0
2 light 2 12 7 9.5
3 moon 2 13 9 11.0
4 bells 1 14 14 14.0
5 beneath 1 11 11 11.0
6 blazed 1 10 10 10.0
7 crowns 1 5 5 5.0
8 dale 1 15 15 15.0
9 dragon 1 1 1 1.0
10 dragon’s 1 5 5 5.0
# ... with 39 more rows
Note that this approach also lets you perform the analysis on multiple focus words at once. All you'd have to do is change filter(word == "fire")
to filter(word %in% c("fire", "otherword"))
, and change group_by(word)
to group_by(focus_word, word)
.
The tidytext answer is a good one, but there are tools in quanteda that can be adapted for this. The main function to count within a window is not kwic()
but rather fcm()
(feature co-occurrence matrix).
require(quanteda)
# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)
# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
## features
## features fire
## Far 1
## over 1
## the 5
## misty 1
## mountains 0
## cold 0
head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
## features
## features fire
## light 2
To get the average distance of the words from the target requires a bit of a hack of the weights function for distance. Below, the weights are applied to consider the counts according to the position, which provides a weighted mean when these are summed and then divided by the total frequency within the window. For your example of "light", for instance:
# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
## features
## light 9.5
## features fire
Getting minimum and maximum position is a bit more complicated, and while I can figure out a way to "hack" this using a combination of the weights to position a binary mask in each position then converting that to a distance. (Too ungainly to show, so I'm recommending the tidy solution unless I think of a more elegant way.)