Why is `str_extract` only catching some of these values?

问题

I have a table that has a "membership type" column that includes a zillion different membership levels that we've used over the years.

example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N", 
                              "Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G",
                              "Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N ", 
                              "Individual (2 yr)",
                              "Individual Producer (Yearly)",
                              "Student Membership (Yearly)"  ))

I would expect that I could add a second column, with at least a rough set of possible values for the membership term with str_extract:

library(stringr)
example$term <-  example$membership %>% 
  str_extract(c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr"))

But that's only catching half the values and I can't find a pattern in what it is skipping.

1   Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
2   Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N  Period Paid: 2
3   Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G  NA
4   Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N  NA
5   Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6   Legacy Payment ID #5238, Payment Record #0, Period Paid: 1 Flag: N  NA
7   Legacy Payment ID #5287, Payment Record #0, Period Paid: 1 Flag: N  NA
8   Legacy Payment ID #5306, Payment Record #0, Period Paid: 1 Flag: N  NA
9   Legacy Payment ID #5739, Payment Record #0, Period Paid: 2 Flag: G  NA
10  Individual (2 yr)                                                   NA
11  Individual Producer (Yearly)                                        Yearly
12  Student Membership (Yearly)                                         NA

The only difference between row 4 and row 5 is the Payment ID. Why is it only finding the search value in Row 5?

And how do I fix it. But mostly why?

回答1:

We can collapse it to a single string with |

library(stringr)
library(dplyr)
pattern_vec <- c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr")
example%>% 
      mutate(term = str_extract(membership,
      str_c(pattern_vec, collapse="|")))
#                                                       membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           2 yr
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)         Yearly

str_extract is vectorized for both the 'string' and 'pattern' except that if there is a vector of length > 1 in 'pattern', then it would be doing an elementwise match i.e. 1st value of 'membership' to 1st value of pattern, 2nd to 2nd and so on. Here, in the OP's case, the lengths are different i.e. column length is different than the pattern length. So, the pattern vector does a recycling by repeating itself from the start after row 4.

In order to check the recycling, you can use rep to replicate the pattern_vec and check the output:

out1 <- example %>% 
      mutate(term = str_extract(membership, rep(pattern_vec, length.out = n())))

out2 <- example %>% 
            mutate(term = str_extract(membership,  pattern_vec))
identical(out1, out2)
#[1] TRUE



out1
#                                                           membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G           <NA>
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N           <NA>
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           <NA>
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)           <NA>

Note from OP:

A post on RStudio Community that helped me (OP) understand the explanation above:

When fed with a single pattern, str_replace_all will compare that pattern for against every element. However, if you pass it a vector, it will try to respect the order, so compare the first pattern with the first object, then the second pattern with the second object.

回答2:

You can use a more complex regex, using lookbehind and lookahead:

example$term <-  example$membership %>% 
    str_extract("Period Paid: \\d+|(?<=\\().*(?=\\))")

Output:

example
                                                           membership           term
1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6                                                   Individual (2 yr)           2 yr
7                                        Individual Producer (Yearly)         Yearly
8                                         Student Membership (Yearly)         Yearly

来源：https://stackoverflow.com/questions/61262340/why-is-str-extract-only-catching-some-of-these-values

标签

tidyverse

stringr