How to calculate proximity of words to a specific term in a document

前端 未结 2 678
情深已故
情深已故 2021-01-23 01:47

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on

2条回答
  •  北荒
    北荒 (楼主)
    2021-01-23 02:35

    The tidytext answer is a good one, but there are tools in quanteda that can be adapted for this. The main function to count within a window is not kwic() but rather fcm() (feature co-occurrence matrix).

    require(quanteda)
    
    # tokenize so that intra-word hyphens and punctuation are removed
    toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)
    
    # all co-occurrences
    head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
    ## Feature co-occurrence matrix of: 155 by 1 feature.
    ## (showing first 6 documents and first feature)
    ##            features
    ## features    fire
    ##   Far          1
    ##   over         1
    ##   the          5
    ##   misty        1
    ##   mountains    0
    ##   cold         0
    
    head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
    ## Feature co-occurrence matrix of: 1 by 1 feature.
    ## 1 x 1 sparse Matrix of class "fcm"
    ##         features
    ## features fire
    ##    light    2
    

    To get the average distance of the words from the target requires a bit of a hack of the weights function for distance. Below, the weights are applied to consider the counts according to the position, which provides a weighted mean when these are summed and then divided by the total frequency within the window. For your example of "light", for instance:

    # average distance
    fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
        fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
    ## 1 x 1 Matrix of class "dgeMatrix"
    ##         features
    ##    light  9.5
    ## features fire
    

    Getting minimum and maximum position is a bit more complicated, and while I can figure out a way to "hack" this using a combination of the weights to position a binary mask in each position then converting that to a distance. (Too ungainly to show, so I'm recommending the tidy solution unless I think of a more elegant way.)

提交回复
热议问题