问题
I'm starting to do a lot of string matching in my work and I'm curious as to what the differences between the three functions are, and in what situations someone would use one over the other.
回答1:
stringr
is a "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package" (from package description). The main advantage of stringi
is the incredible speed of the package compared to base R
. The output of the functions is the same in base as in stringr.
I use stringi
to generate some random text for demonstration:
library(stringr)
sample_small <- stringi::stri_rand_lipsum(100)
grep
provides the position of a pattern in the character vector, just as it's equivalent str_which
does:
grep("Lorem", sample_small)
#> [1] 1 9 14 32 45 50 65 93 94
str_which(sample_small, "Lorem")
#> [1] 1 9 14 32 45 50 65 93 94
grepl
/str_detect
on the other hand give you the information for each element of the vector, if it contains the string or not.
grepl("Lorem", sample_small)
#> [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#> [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [89] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
str_detect(sample_small, "Lorem")
#> [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#> [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [89] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
There are many scenarios where the different outcome could make a difference for you. I'm usually using grepl
if I'm interested in adding a new column to a data.frame that contains information on whether a different column contains a pattern. grepl
makes this easier as it has the same length as the input variable:
df <- data.frame(sample = sample_small,
stringsAsFactors = FALSE)
df$lorem <- grepl("Lorem", sample_small)
df$ipsum <- grepl("ipsum", sample_small)
This way, some more elaborate tests are possible:
which(df$lorem & df$ipsum)
#> [1] 1 5 15 53 71 75
Or directly as a filter
rule:
df %>%
filter(str_detect("Lorem", sample_small) & str_detect("ipsum", sample_small))
Now in terms of why to use stringr
over base, I think there are two arguments: different syntax makes it a little bit easier to use stringr
with pipes
library(dplyr)
sample_small %>%
str_detect("Lorem")
compared to:
sample_small %>%
grepl("Lorem", .)
And stringr
is roughly 5x faster than base (for the two functions we are looking at):
sample_big <- stringi::stri_rand_lipsum(100000)
bench::mark(
base = grep("Lorem", sample_big),
stringr = str_which(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 674ms 674ms 1.48 415KB 0
#> 2 stringr 141ms 142ms 6.99 806KB 0
bench::mark(
base = grepl("Lorem", sample_big),
stringr = str_detect(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 679ms 679ms 1.47 391KB 0
#> 2 stringr 146ms 148ms 6.76 391KB 0
The difference is even more striking when we look for exact matches (the default is to look for regular expressions)
bench::mark(
base = grepl("Lorem", sample_big, fixed = TRUE),
stringr = str_detect(sample_big, fixed("Lorem"))
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 336ms 338.1ms 2.96 391KB 0
#> 2 stringr 12.4ms 12.6ms 79.1 417KB 0
However, I think the base functions have a certain charm to them, which is why I often still use them when writing code quickly. The option fixed = TRUE
is one example. Wrapping fixed()
around the pattern feels just a little awkward to me. Other examples would be the option value = TRUE
in grep
(I let you figure that one out yourself) and finally ignore.case = TRUE
which, again looks a little awkward in stringr
:
str_which(sample_small, regex("Lorem", ignore_case = TRUE))
#> [1] 1 5 6 8 9 11 12 14 15 17 22 27 30 32 34 35 42 48 51 53 58 64 69
#> [24] 74 76 80 83 86 89 91 92 94 97
However, the reason this is awkward for me is probably just because I used base R
for a while before learning stringr
.
Another point to consider is that with stringi
, you have even more features overall. So if you are determined to get into string manipulation, you might start to learn that package right away - although there are admittedly less tutorials and it might be a bit tougher to figure some things out.
来源:https://stackoverflow.com/questions/57412700/whats-the-difference-between-the-str-detect-function-in-stringer-and-grepl-and