Overlapping matches in R

前端 未结 6 1233
不思量自难忘°
不思量自难忘° 2020-12-01 18:22

I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.

I also found the following SO question speaking of findin

相关标签:
6条回答
  • 2020-12-01 18:58

    It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.

    x <- 'ACCACCACCAC'
    y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
    y[y != "CA"]
    # [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
    
    0 讨论(0)
  • 2020-12-01 19:09

    The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

    x <- 'ACCACCACCAC'
    m <- gregexpr('(?=([AC]C))', x, perl=T)
    regmatches(x, m) <- "~"
    x
    # [1] "~A~CC~A~CC~A~CC~AC"
    

    Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

    I've created a regcapturedmatches() function that I often use for such tasks. For example

    x <- 'ACCACCACCAC'
    regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
    
    #      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
    # [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
    

    The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

    0 讨论(0)
  • 2020-12-01 19:10

    A stringi solution using a capture group in the look-ahead part:

    > stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
    ## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"    
    
    0 讨论(0)
  • 2020-12-01 19:14

    As far as a workaround, this is what I have come up with to extract the overlapping matches.

    > x <- 'ACCACCACCAC'
    > m <- gregexpr('(?=([AC]C))', x, perl=T)
    > mapply(function(X) substr(x, X, X+1), m[[1]])
    [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
    

    Please feel free to add or comment on a better way to perform this task.

    0 讨论(0)
  • 2020-12-01 19:14

    An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:

    > x <- 'ACCACCACCAC'
    > m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
    > start <- attr(m,"capture.start")
    > end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
    > sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
    [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
    

    Pretty ugly, which is why the stringr etc. packages exist.

    0 讨论(0)
  • 2020-12-01 19:20

    Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":

    x <- c("ACCACCACCAC","ACCACCACCAC")
    m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
    m <- lapply(m, function(i) {
           attr(i,"match.length") <- attr(i,"capture.length")
           i
         })
    regmatches(x,m)
    
    #[[1]]
    #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
    #
    #[[2]]
    #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
    
    0 讨论(0)
提交回复
热议问题