How to calculate proximity of words to a specific term in a document

前端 未结 2 679
情深已故
情深已故 2021-01-23 01:47

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on

相关标签:
2条回答
  • 2021-01-23 02:26

    I'd suggest solving this with a combination of my tidytext and fuzzyjoin packages.

    You can start by tokenizing it into a one-row-per-word data frame, adding a position column, and removing stopwords:

    library(tidytext)
    library(dplyr)
    
    all_words <- data_frame(text = song) %>%
      unnest_tokens(word, text) %>%
      mutate(position = row_number()) %>%
      filter(!word %in% tm::stopwords("en"))
    

    You can then find just the word fire, and use difference_inner_join() from fuzzyjoin to find all rows within 15 words of those rows. You can then use group_by() and summarize() to get your desired statistics for each word.

    library(fuzzyjoin)
    
    nearby_words <- all_words %>%
      filter(word == "fire") %>%
      select(focus_term = word, focus_position = position) %>%
      difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
      mutate(distance = abs(focus_position - position))
    
    words_summarized <- nearby_words %>%
      group_by(word) %>%
      summarize(number = n(),
                maximum_distance = max(distance),
                minimum_distance = min(distance),
                average_distance = mean(distance)) %>%
      arrange(desc(number))
    

    Output in this case:

    # A tibble: 49 × 5
           word number maximum_distance minimum_distance average_distance
          <chr>  <int>            <dbl>            <dbl>            <dbl>
     1     fire      3                0                0              0.0
     2    light      2               12                7              9.5
     3     moon      2               13                9             11.0
     4    bells      1               14               14             14.0
     5  beneath      1               11               11             11.0
     6   blazed      1               10               10             10.0
     7   crowns      1                5                5              5.0
     8     dale      1               15               15             15.0
     9   dragon      1                1                1              1.0
    10 dragon’s      1                5                5              5.0
    # ... with 39 more rows
    

    Note that this approach also lets you perform the analysis on multiple focus words at once. All you'd have to do is change filter(word == "fire") to filter(word %in% c("fire", "otherword")), and change group_by(word) to group_by(focus_word, word).

    0 讨论(0)
  • 2021-01-23 02:35

    The tidytext answer is a good one, but there are tools in quanteda that can be adapted for this. The main function to count within a window is not kwic() but rather fcm() (feature co-occurrence matrix).

    require(quanteda)
    
    # tokenize so that intra-word hyphens and punctuation are removed
    toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)
    
    # all co-occurrences
    head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
    ## Feature co-occurrence matrix of: 155 by 1 feature.
    ## (showing first 6 documents and first feature)
    ##            features
    ## features    fire
    ##   Far          1
    ##   over         1
    ##   the          5
    ##   misty        1
    ##   mountains    0
    ##   cold         0
    
    head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
    ## Feature co-occurrence matrix of: 1 by 1 feature.
    ## 1 x 1 sparse Matrix of class "fcm"
    ##         features
    ## features fire
    ##    light    2
    

    To get the average distance of the words from the target requires a bit of a hack of the weights function for distance. Below, the weights are applied to consider the counts according to the position, which provides a weighted mean when these are summed and then divided by the total frequency within the window. For your example of "light", for instance:

    # average distance
    fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
        fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
    ## 1 x 1 Matrix of class "dgeMatrix"
    ##         features
    ##    light  9.5
    ## features fire
    

    Getting minimum and maximum position is a bit more complicated, and while I can figure out a way to "hack" this using a combination of the weights to position a binary mask in each position then converting that to a distance. (Too ungainly to show, so I'm recommending the tidy solution unless I think of a more elegant way.)

    0 讨论(0)
提交回复
热议问题