Split string based on alternating character in R

后端 未结 9 434
醉话见心
醉话见心 2021-01-30 10:02

I\'m trying to figure out an efficient way to go about splitting a string like

\"111110000011110000111000\"

into a vector

[1] \         


        
9条回答
  •  一整个雨季
    2021-01-30 10:40

    Original Approach: Here is a stringi approach that incorporates rle().

    x <- "111110000011110000111000"
    library(stringi)
    
    cs <- cumsum(
        rle(stri_split_boundaries(x, type = "character")[[1L]])$lengths
    )
    stri_sub(x, c(1L, head(cs + 1L, -1L)), cs)
    # [1] "11111" "00000" "1111"  "0000"  "111"   "000"  
    

    Or, you can use the length argument in stri_sub()

    rl <- rle(stri_split_boundaries(x, type = "character")[[1L]])
    with(rl, {
        stri_sub(x, c(1L, head(cumsum(lengths) + 1L, -1L)), length = lengths)
    })
    # [1] "11111" "00000" "1111"  "0000"  "111"   "000"  
    

    Updated for Efficiency: After realizing that base::strsplit() is faster than stringi::stri_split_boundaries(), here is a more efficient version of my previous answer using only base functions.

    set.seed(24)
    x3 <- stri_rand_strings(1L, 1e6L)
    
    system.time({
        cs <- cumsum(rle(strsplit(x3, NULL)[[1L]])[[1L]])
        substring(x3, c(1L, head(cs + 1L, -1L)), cs)
    })
    #   user  system elapsed 
    #  0.686   0.012   0.697 
    

提交回复
热议问题