I have this mystring
with the delimiter _
. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if th
Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.
IMO, this feature is elegant when you want to supply different alternatives.
x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam',
'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')
sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"
gregexpr
can search for a pattern in strings and give the location.
First, we use gregexpr
to find the location of all _
in each element of mystring
. Then, we loop through that output and extract the index of second _
within each element of mystring
. If there is no second _
, it'll return an NA
(check inds
in the example below).
After that, we can either extract the relevant part using substr
based on the extracted index or, if there is NA
, we can split the string at .ReCal
and keep only the first part.
inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
ifelse(!is.na(inds),
substr(mystring, 1, inds - 1),
sapply(strsplit(mystring, ".ReCal"), '[', 1))
#[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$
You can simply do using gsub
without using any complex regex.Just replace by \\1
.See demo.
https://regex101.com/r/wL4aB6/1
With the stringr
package:
str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
It also works with more than two delimiters.
A little longer, but needs less regular expression knowledge:
library(stringr)
indx <- str_locate_all(mystring, "_")
for (i in seq_along(indx)) {
if (nrow(indx[[i]]) == 1) {
mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
} else {
mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
}
}