问题
Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr
:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec
:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec
function, extending regexec
as gregexpr
extends regexpr
, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches
as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r
given above is just a silly example, it must remain arbitrary.
回答1:
Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr
functions also have parallels in base R, so you can also achieve this without using stringr
.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
回答2:
For a base R solution, what about just using gsub()
to finish processing the strings extracted by gregexpr()
and regmatches()
?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"
回答3:
strapplyc
in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc
for additional info and examples.
Related Functions
1) A generalization of strapplyc
is strapply
in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c
it reduces to strapplyc
. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn
is another related function in the same package. It is like gsub
except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#}
where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply
and gsubfn
examples.
来源:https://stackoverflow.com/questions/18620571/extract-capture-group-matches-from-regular-expressions-or-where-is-gregexec