I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.
I also found the following SO question speaking of findin
It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA
elements.
x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
A stringi
solution using a capture group in the look-ahead part:
> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
As far as a workaround, this is what I have come up with to extract the overlapping matches.
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Please feel free to add or comment on a better way to perform this task.
An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Pretty ugly, which is why the stringr
etc. packages exist.
Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length"
with the "capture.length"
:
x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i) {
attr(i,"match.length") <- attr(i,"capture.length")
i
})
regmatches(x,m)
#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"