find word near another using stringr

问题

I have a simple problem, consider this example

library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
                                   'but sometimes pretty bad as well'))

# A tibble: 2 x 1
                                  mytext
                                   <chr>
1 stackoverflow is pretty good my friend
2       but sometimes pretty bad as well

I want to count the number of times stackoverflow is near good. I use the following regex but it does not work.

dataframe %>%  mutate(mycount = str_count(mytext, 
 regex('stackoverflow(?:\\w+){0,5}good', ignore_case = TRUE)))
# A tibble: 2 x 2
                                  mytext mycount
                                   <chr>   <int>
1 stackoverflow is pretty good my friend       0
2       but sometimes pretty bad as well       0

Can someone tell me what am I missing here?

Thanks!

回答1:

I had a bunch of trouble with this too and I'm still not sure why the things I was trying didn't work. But I'm only decent at regular expressions, not an expert. However, I was able to get it to work with lookback and lookforward.

library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
                                   'but sometimes pretty bad as well',
                                   'stackoverflow one two three four five six good',
                                   'stackoverflow good'))

dataframe
dataframe %>%  mutate(mycount = str_count(mytext, 
      regex('(?<=stackoverflow)\\s(?:\\w+\\s){0,5}(?=good)', ignore_case = TRUE)))
## A tibble: 4 x 2
#                                          mytext mycount
#                                           <chr>   <int>
#1         stackoverflow is pretty good my friend       1
#2               but sometimes pretty bad as well       0
#3 stackoverflow one two three four five six good       0
#4                             stackoverflow good       1

回答2:

The corpus library makes this pretty easy:

library(corpus)
dataframe <- data.frame(mytext = c('stackoverflow is pretty good my friend',
                                   'but sometimes pretty bad as well'))

# find instances of 'stackoverflow'
loc <- text_locate(dataframe$mytext, "stackoverflow")

# count the number of times 'good' is within 5 tokens
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
              | text_detect(text_sub(loc$after, 1, 4), "good"))

# aggregate over text
count <- tapply(near_good, loc$text, sum, default = 0)

Conceptually, corpus treats text as a sequence of tokens. The library allows you to index these sequences using the text_sub() command. You can also change the definition of a token using a text_filter().

Here's an example that works the same way but ignores punctuation-only tokens:

corpus <- corpus_frame(text = c("Stackoverflow, is pretty (?) GOOD my friend!",
                                "But sometimes pretty bad as well"))
text_filter(corpus)$drop_punct <- TRUE

loc <- text_locate(corpus, "stackoverflow")
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
              | text_detect(text_sub(loc$after, 1, 4), "good"))
count <- tapply(near_good, loc$text, sum, default = 0)

回答3:

I think I got it

dataframe %>%  
mutate(mycount = str_count(mytext, 
                 regex('stackoverflow\\W+(?:\\w+ ){0,5}good', ignore_case = TRUE)))

# A tibble: 4 x 2
                                  mytext mycount
                                   <chr>   <int>
1 stackoverflow is pretty good my friend       1
2       but sometimes pretty bad as well       0
3  stackoverflow good good stackoverflow       1
4                      stackoverflowgood       0

The key was adding the \W+ meta-character that matches anything between words.

来源：https://stackoverflow.com/questions/46934765/find-word-near-another-using-stringr

标签

dplyr

stringr