问题
I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry.
The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes.
The result is:
"kwic object with 0 rows"
x <- tokens("barriers entry")
ntoken_test <- tokens_ngrams(x, n = 2, skip = 0:4, concatenator = " ")
twic_skipgram <- kwic(doc.corpus, pattern = list(ntoken_test), window=20, valuetype= "glob")
twic_skipgram
回答1:
Probably the easiest way is wildcarding to represent the "skip".
library("quanteda")
## Package version: 2.1.1
txt <- c(
"There are barriers to entry.",
"Also barriers against entry.",
"Just barriers entry."
)
# for skip of 1
kwic(txt, phrase("barriers * entry"))
##
## [text1, 3:5] There are | barriers to entry | .
## [text2, 2:4] Also | barriers against entry | .
# for skip of 0 and 1
kwic(txt, phrase(c("barriers * entry", "barriers entry")))
##
## [text1, 3:5] There are | barriers to entry | .
## [text2, 2:4] Also | barriers against entry | .
## [text3, 2:3] Just | barriers entry | .
来源:https://stackoverflow.com/questions/63150922/keyword-in-context-kwic-for-skipgrams