I\'m still a little confused by regex syntax. Can you please help me with these patterns:
_A00_A1234B_
_A00_A12345B_
_A1_A12345_
my approac
vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
You can use sub
and this regex:
sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B" "A12345B" "A12345"
Using rex to construct the regular expression may make it more understandable.
x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
# approach #1, assumes always is between the second underscores.
re_matches(x,
rex(
"_",
anything,
"_",
capture(anything),
"_"
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
rex(
capture(
alpha,
between(digit, 4, 5),
maybe(alpha)
)
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
You can do this without using a regular expression ...
x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B" "A12345B" "A12345"
If you insist on using a regular expression, the following will suffice.
regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))
You can try
library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Here, the pattern looks for a capital letter [A-Z]
, followed by 4
or 5 digits [0-9]{4,5}
, followed by a capital letter [A-Z]
?
Or you can use stringi
which would be faster
library(stringi)
stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Or a base R
option would be
regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
#[1] "A1234B" "A12345B" "A12345"
str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')