Assume I have text strings that look something like this:
A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3
Here I want to identify sequ
Try the following expression: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])*
.
See the match groups:
https://regex101.com/r/yA6aV9/1
Use strsplit
> x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3"
> strsplit(x, "(?:-?I\\d+)*-?\\bI3-?(?:I\\d+-?)*")
[[1]]
[1] "A-B-C-I1-I2-D-E-F" "D-D-D-D"
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I\\d+-?)*")
[[1]]
[1] "A-B" "C"
or
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I3-?)*")
[[1]]
[1] "A-B" "C"
You can identify the sequences which contains I3
with following regex :
(?:I\\d-?)*I3(?:-?I\\d)*
So you can split your text with this regex to get the desire result.
See demo https://regex101.com/r/bJ3iA3/4