问题
I have a simple problem, consider this example
library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well'))
# A tibble: 2 x 1
mytext
<chr>
1 stackoverflow is pretty good my friend
2 but sometimes pretty bad as well
I want to count the number of times stackoverflow
is near good
. I use the following regex but it does not work.
dataframe %>% mutate(mycount = str_count(mytext,
regex('stackoverflow(?:\\w+){0,5}good', ignore_case = TRUE)))
# A tibble: 2 x 2
mytext mycount
<chr> <int>
1 stackoverflow is pretty good my friend 0
2 but sometimes pretty bad as well 0
Can someone tell me what am I missing here?
Thanks!
回答1:
I had a bunch of trouble with this too and I'm still not sure why the things I was trying didn't work. But I'm only decent at regular expressions, not an expert. However, I was able to get it to work with lookback and lookforward.
library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well',
'stackoverflow one two three four five six good',
'stackoverflow good'))
dataframe
dataframe %>% mutate(mycount = str_count(mytext,
regex('(?<=stackoverflow)\\s(?:\\w+\\s){0,5}(?=good)', ignore_case = TRUE)))
## A tibble: 4 x 2
# mytext mycount
# <chr> <int>
#1 stackoverflow is pretty good my friend 1
#2 but sometimes pretty bad as well 0
#3 stackoverflow one two three four five six good 0
#4 stackoverflow good 1
回答2:
The corpus library makes this pretty easy:
library(corpus)
dataframe <- data.frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well'))
# find instances of 'stackoverflow'
loc <- text_locate(dataframe$mytext, "stackoverflow")
# count the number of times 'good' is within 5 tokens
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
| text_detect(text_sub(loc$after, 1, 4), "good"))
# aggregate over text
count <- tapply(near_good, loc$text, sum, default = 0)
Conceptually, corpus treats text as a sequence of tokens. The library allows you to index these sequences using the text_sub()
command. You can also change the definition of a token using a text_filter()
.
Here's an example that works the same way but ignores punctuation-only tokens:
corpus <- corpus_frame(text = c("Stackoverflow, is pretty (?) GOOD my friend!",
"But sometimes pretty bad as well"))
text_filter(corpus)$drop_punct <- TRUE
loc <- text_locate(corpus, "stackoverflow")
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
| text_detect(text_sub(loc$after, 1, 4), "good"))
count <- tapply(near_good, loc$text, sum, default = 0)
回答3:
I think I got it
dataframe %>%
mutate(mycount = str_count(mytext,
regex('stackoverflow\\W+(?:\\w+ ){0,5}good', ignore_case = TRUE)))
# A tibble: 4 x 2
mytext mycount
<chr> <int>
1 stackoverflow is pretty good my friend 1
2 but sometimes pretty bad as well 0
3 stackoverflow good good stackoverflow 1
4 stackoverflowgood 0
The key was adding the \W+
meta-character that matches anything between words.
来源:https://stackoverflow.com/questions/46934765/find-word-near-another-using-stringr