问题
I want to find collocates of a word in text strings. A word's collocates are those words that co-occur with it either preceding or following it. Here's a made-up example:
GO <- c("This little sentence went on and on.",
"It was going on for quite a while.",
"In fact it has been going on for ages.",
"It still goes on.",
"It would go on even if it didn't.")
Let's say I'm interested in the words collocating with the lemma GO including all the forms the verb 'go' can take, namely 'go', 'went', 'gone', 'goes', and 'going', and I want to extract both collocates on the left and the right of GO using str_extract
from package stringr
and assemble the collocates in a dataframe. This is all well as far as single-word collocates are concerned. I can do it like this:
collocates <- data.frame(
Left = str_extract(GO, "\\w+\\b\\s(?=(go(es|ing|ne)?|went))"),
Node = str_extract(GO, "go(es|ing|ne)?|went"),
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)\\s\\w+\\b"))
That's the result:
collocates
Left Node Right
1 sentence went on
2 was going on
3 been going on
4 still goes on
5 would go on
But I'm interested not just in the one word before and after GO but, say, in up to three words before and after GO. Now using quantifier expressions gets me closer to the desired result but not quite there:
collocates <- data.frame(
Left = str_extract(GO, "(\\w+\\b\\s){0,3}(?=(go(es|ing|ne)?|went))"),
Node = str_extract(GO, "go(es|ing|ne)?|went"),
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\\s\\w+\\b){0,3}"))
And that's the result now:
collocates
Left Node Right
1 This little sentence went on and on
2 It was going
3 it has been going
4 It still goes
5 It probably would go on even if
While the collocates on the left side are all as desired, the collocates on the right side are partially missing. Why is that? And how can the code be changed to match all collocates correctly?
Expected output:
Left Node Right
1 This little sentence went on and on
2 It was going on for quite
3 it has been going on for ages
4 It still goes on
5 It would go on even if
回答1:
Using the quantifier {0,3}
(meaning match between 0 and 3 of the preceding token) will simply allow the first word in the match group to be skipped if the maximum isn't reached.
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\\s\\w+\\b){0,3}"))
↳ Debuggex Demo
Including a minimum quantifier of 1 you can guarantee that if there is at least one word to the right of the first match group then it will be captured. With zero it will skip over the first word and proceed to capture whatever is remaining up to the maximum specified.
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
↳ Debuggex Demo
This can be further demonstrated by playing with the quantifier values and observing the following:
r <- data.frame(
Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\\s\\w+\\b){2,2}"))
print(r)
Right
1 on and
2 on for
3 on for
4 <NA>
5 on even
In the example above we chose {2,2}
, (minimum 2 and maximum 2); since there weren't enough words to capture exactly 2 in the 4th row we get <NA>
.
来源:https://stackoverflow.com/questions/59943119/regex-in-r-match-collocates-of-node-word