I\'m trying to figure out an efficient way to go about splitting a string like
\"111110000011110000111000\"
into a vector
[1] \
It's not really what the OP was looking for (concise R code), but thought I'd give it a try in Rcpp
, and turned out relatively simple and about 5x faster than the fastest R-based answers.
library(Rcpp)
cppFunction(
'std::vector split_str_cpp(std::string x) {
std::vector parts;
int start = 0;
for(int i = 1; i <= x.length(); i++) {
if(x[i] != x[i-1]) {
parts.push_back(x.substr(start, i-start));
start = i;
}
}
return parts;
}')
And testing on these
str1 <- "111110000011110000111000"
x1 <- "1111100000222000333300011110000111000"
x2 <- "aaaaabbcccccccbbbad1111100000222aaabbccd11DaaBB"
Gives the following output
> split_str_cpp(str1)
[1] "11111" "00000" "1111" "0000" "111" "000"
> split_str_cpp(x1)
[1] "11111" "00000" "222" "000" "3333" "000" "1111" "0000" "111" "000"
> split_str_cpp(x2)
[1] "aaaaa" "bb" "ccccccc" "bbb" "a" "d" "11111" "00000" "222" "aaa" "bb" "cc" "d" "11"
[15] "D" "aa" "BB"
And a benchmark shows it's about 5-10x faster than R solutions.
akrun <- function(str1) strsplit(str1, '(?<=1)(?=0)|(?<=0)(?=1)', perl=TRUE)[[1]]
richard1 <- function(x3){
cs <- cumsum(
rle(stri_split_boundaries(x3, type = "character")[[1L]])$lengths
)
stri_sub(x3, c(1, head(cs + 1, -1)), cs)
}
richard2 <- function(x3) {
cs <- cumsum(rle(strsplit(x3, NULL)[[1L]])[[1L]])
stri_sub(x3, c(1, head(cs + 1, -1)), cs)
}
library(microbenchmark)
library(stringi)
set.seed(24)
x3 <- stri_rand_strings(1, 1e6)
microbenchmark(split_str_cpp(x3), akrun(x3), richard1(x3), richard2(x3), unit = 'relative', times=20L)
Comparison:
Unit: relative
expr min lq mean median uq max neval
split_str_cpp(x3) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
akrun(x3) 9.675613 8.952997 8.241750 8.689001 8.403634 4.423134 20
richard1(x3) 5.355620 5.226103 5.483171 5.947053 5.982943 3.379446 20
richard2(x3) 4.842398 4.756086 5.046077 5.389570 5.389193 3.669680 20