Why is `str_extract` only catching some of these values?

血红的双手。 提交于 2020-07-07 09:56:29

问题


I have a table that has a "membership type" column that includes a zillion different membership levels that we've used over the years.

example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N", 
                              "Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G",
                              "Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N ", 
                              "Individual (2 yr)",
                              "Individual Producer (Yearly)",
                              "Student Membership (Yearly)"  ))

I would expect that I could add a second column, with at least a rough set of possible values for the membership term with str_extract:

library(stringr)
example$term <-  example$membership %>% 
  str_extract(c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr"))

But that's only catching half the values and I can't find a pattern in what it is skipping.

1   Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
2   Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N  Period Paid: 2
3   Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G  NA
4   Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N  NA
5   Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6   Legacy Payment ID #5238, Payment Record #0, Period Paid: 1 Flag: N  NA
7   Legacy Payment ID #5287, Payment Record #0, Period Paid: 1 Flag: N  NA
8   Legacy Payment ID #5306, Payment Record #0, Period Paid: 1 Flag: N  NA
9   Legacy Payment ID #5739, Payment Record #0, Period Paid: 2 Flag: G  NA
10  Individual (2 yr)                                                   NA
11  Individual Producer (Yearly)                                        Yearly
12  Student Membership (Yearly)                                         NA

The only difference between row 4 and row 5 is the Payment ID. Why is it only finding the search value in Row 5?

And how do I fix it. But mostly why?


回答1:


We can collapse it to a single string with |

library(stringr)
library(dplyr)
pattern_vec <- c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr")
example%>% 
      mutate(term = str_extract(membership,
      str_c(pattern_vec, collapse="|")))
#                                                       membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           2 yr
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)         Yearly

str_extract is vectorized for both the 'string' and 'pattern' except that if there is a vector of length > 1 in 'pattern', then it would be doing an elementwise match i.e. 1st value of 'membership' to 1st value of pattern, 2nd to 2nd and so on. Here, in the OP's case, the lengths are different i.e. column length is different than the pattern length. So, the pattern vector does a recycling by repeating itself from the start after row 4.

In order to check the recycling, you can use rep to replicate the pattern_vec and check the output:

out1 <- example %>% 
      mutate(term = str_extract(membership, rep(pattern_vec, length.out = n())))

out2 <- example %>% 
            mutate(term = str_extract(membership,  pattern_vec))
identical(out1, out2)
#[1] TRUE



out1
#                                                           membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G           <NA>
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N           <NA>
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           <NA>
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)           <NA>

Note from OP:

A post on RStudio Community that helped me (OP) understand the explanation above:

When fed with a single pattern, str_replace_all will compare that pattern for against every element. However, if you pass it a vector, it will try to respect the order, so compare the first pattern with the first object, then the second pattern with the second object.




回答2:


You can use a more complex regex, using lookbehind and lookahead:

example$term <-  example$membership %>% 
    str_extract("Period Paid: \\d+|(?<=\\().*(?=\\))")

Output:

example
                                                           membership           term
1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6                                                   Individual (2 yr)           2 yr
7                                        Individual Producer (Yearly)         Yearly
8                                         Student Membership (Yearly)         Yearly


来源:https://stackoverflow.com/questions/61262340/why-is-str-extract-only-catching-some-of-these-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!