Split string based on alternating character in R

后端 未结 9 455
醉话见心
醉话见心 2021-01-30 10:02

I\'m trying to figure out an efficient way to go about splitting a string like

\"111110000011110000111000\"

into a vector

[1] \         


        
9条回答
  •  清酒与你
    2021-01-30 10:30

    It's not really what the OP was looking for (concise R code), but thought I'd give it a try in Rcpp, and turned out relatively simple and about 5x faster than the fastest R-based answers.

    library(Rcpp)
    
    cppFunction(
      'std::vector split_str_cpp(std::string x) {
    
      std::vector parts;
    
      int start = 0;
    
      for(int i = 1; i <= x.length(); i++) {
          if(x[i] != x[i-1]) {
            parts.push_back(x.substr(start, i-start));
            start = i;
          } 
      }
    
      return parts;
    
      }')
    

    And testing on these

    str1 <- "111110000011110000111000"
    x1 <- "1111100000222000333300011110000111000"
    x2 <- "aaaaabbcccccccbbbad1111100000222aaabbccd11DaaBB"
    

    Gives the following output

    > split_str_cpp(str1)
    [1] "11111" "00000" "1111"  "0000"  "111"   "000"  
    > split_str_cpp(x1)
     [1] "11111" "00000" "222"   "000"   "3333"  "000"   "1111"  "0000"  "111"   "000"  
    > split_str_cpp(x2)
     [1] "aaaaa"   "bb"      "ccccccc" "bbb"     "a"       "d"       "11111"   "00000"   "222"     "aaa"     "bb"      "cc"      "d"       "11"     
    [15] "D"       "aa"      "BB"   
    

    And a benchmark shows it's about 5-10x faster than R solutions.

    akrun <- function(str1) strsplit(str1, '(?<=1)(?=0)|(?<=0)(?=1)', perl=TRUE)[[1]]
    
    richard1 <- function(x3){
      cs <- cumsum(
        rle(stri_split_boundaries(x3, type = "character")[[1L]])$lengths
      )
      stri_sub(x3, c(1, head(cs + 1, -1)), cs)
    }
    
    richard2 <- function(x3) {
      cs <- cumsum(rle(strsplit(x3, NULL)[[1L]])[[1L]])
      stri_sub(x3, c(1, head(cs + 1, -1)), cs)
    }
    
    library(microbenchmark)
    library(stringi)
    
    set.seed(24)
    x3 <- stri_rand_strings(1, 1e6)
    
    microbenchmark(split_str_cpp(x3), akrun(x3), richard1(x3), richard2(x3), unit = 'relative', times=20L)
    

    Comparison:

    Unit: relative
                  expr      min       lq     mean   median       uq      max neval
     split_str_cpp(x3) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20
             akrun(x3) 9.675613 8.952997 8.241750 8.689001 8.403634 4.423134    20
          richard1(x3) 5.355620 5.226103 5.483171 5.947053 5.982943 3.379446    20
          richard2(x3) 4.842398 4.756086 5.046077 5.389570 5.389193 3.669680    20
    

提交回复
热议问题