问题
I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.
I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.
I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.
But, while actually performing this the same way I would in other languages, using perl=T
in R, no results yield.
> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""
The same goes for using both the stringi
and stringr
package.
> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""
The correct results that should be returned when executing this are:
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Edit
I am well aware that
regmatches
does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.Is the
stringi
andstringr
package not capable of performing this overregmatches
?Please feel free to add to my answer or come up with a different workaround than I have found.
回答1:
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
回答2:
As far as a workaround, this is what I have come up with to extract the overlapping matches.
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Please feel free to add or comment on a better way to perform this task.
回答3:
Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length"
with the "capture.length"
:
x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i) {
attr(i,"match.length") <- attr(i,"capture.length")
i
})
regmatches(x,m)
#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
回答4:
It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA
elements.
x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
回答5:
A stringi
solution using a capture group in the look-ahead part:
> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
回答6:
An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Pretty ugly, which is why the stringr
etc. packages exist.
来源:https://stackoverflow.com/questions/25800042/overlapping-matches-in-r