What's the difference between the str_detect function in stringer and grepl and grep? [closed]

只愿长相守 提交于 2021-01-01 14:34:40

问题


I'm starting to do a lot of string matching in my work and I'm curious as to what the differences between the three functions are, and in what situations someone would use one over the other.


回答1:


stringr is a "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package" (from package description). The main advantage of stringi is the incredible speed of the package compared to base R. The output of the functions is the same in base as in stringr.

I use stringi to generate some random text for demonstration:

library(stringr)
sample_small <- stringi::stri_rand_lipsum(100)

grep provides the position of a pattern in the character vector, just as it's equivalent str_which does:

grep("Lorem", sample_small)
#> [1]  1  9 14 32 45 50 65 93 94
str_which(sample_small, "Lorem")
#> [1]  1  9 14 32 45 50 65 93 94

grepl/str_detect on the other hand give you the information for each element of the vector, if it contains the string or not.

grepl("Lorem", sample_small)
#>   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
str_detect(sample_small, "Lorem")
#>   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE

There are many scenarios where the different outcome could make a difference for you. I'm usually using grepl if I'm interested in adding a new column to a data.frame that contains information on whether a different column contains a pattern. grepl makes this easier as it has the same length as the input variable:

df <- data.frame(sample = sample_small,
                 stringsAsFactors = FALSE)
df$lorem <- grepl("Lorem", sample_small)
df$ipsum <- grepl("ipsum", sample_small)

This way, some more elaborate tests are possible:

which(df$lorem & df$ipsum)
#> [1]  1  5 15 53 71 75

Or directly as a filter rule:

df %>% 
  filter(str_detect("Lorem", sample_small) & str_detect("ipsum", sample_small))

Now in terms of why to use stringr over base, I think there are two arguments: different syntax makes it a little bit easier to use stringr with pipes

library(dplyr)
sample_small %>% 
  str_detect("Lorem")

compared to:

sample_small %>% 
  grepl("Lorem", .) 

And stringr is roughly 5x faster than base (for the two functions we are looking at):

sample_big <- stringi::stri_rand_lipsum(100000)
bench::mark(
  base = grep("Lorem", sample_big),
  stringr = str_which(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          674ms    674ms      1.48     415KB        0
#> 2 stringr       141ms    142ms      6.99     806KB        0


bench::mark(
  base = grepl("Lorem", sample_big),
  stringr = str_detect(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          679ms    679ms      1.47     391KB        0
#> 2 stringr       146ms    148ms      6.76     391KB        0

The difference is even more striking when we look for exact matches (the default is to look for regular expressions)

bench::mark(
  base = grepl("Lorem", sample_big, fixed = TRUE),
  stringr = str_detect(sample_big, fixed("Lorem"))
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          336ms  338.1ms      2.96     391KB        0
#> 2 stringr      12.4ms   12.6ms     79.1      417KB        0

However, I think the base functions have a certain charm to them, which is why I often still use them when writing code quickly. The option fixed = TRUE is one example. Wrapping fixed() around the pattern feels just a little awkward to me. Other examples would be the option value = TRUE in grep (I let you figure that one out yourself) and finally ignore.case = TRUE which, again looks a little awkward in stringr:

str_which(sample_small, regex("Lorem", ignore_case = TRUE))
#>  [1]  1  5  6  8  9 11 12 14 15 17 22 27 30 32 34 35 42 48 51 53 58 64 69
#> [24] 74 76 80 83 86 89 91 92 94 97

However, the reason this is awkward for me is probably just because I used base R for a while before learning stringr.

Another point to consider is that with stringi, you have even more features overall. So if you are determined to get into string manipulation, you might start to learn that package right away - although there are admittedly less tutorials and it might be a bit tougher to figure some things out.



来源:https://stackoverflow.com/questions/57412700/whats-the-difference-between-the-str-detect-function-in-stringer-and-grepl-and

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!