If we want to extract the digits with -
between the braces, one option is str_extract
. If there are multiple patterns within a string, use str_extract_all
library(stringr)
str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
#[1] "123-456-789"
str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')
In the above codes, we are using regex lookarounds to extract the numbers and the -
. The positive lookbehind (?<=\\()[0-9-]+
matches numbers along with -
([0-9-]+
) in (123-456-789
and not in 123-456-789
. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with -
in 123-456-789)
and not in 123-456-798
. Taken together it matches all the cases that satisfy both the conditions (123-456-789)
and extract those in between the lookarounds and not with cases like (123-456-789
or 123-456-789)
With strsplit
you can specify the split
as [()]
. We keep the ()
inside the square brackets to []
to treat it as characters or else we have to escape the parentheses ('\\(|\\)'
).
strsplit(str1, '[()]')[[1]][2]
#[1] "123-456-789"
If there are multiple substrings to extract from a string, we could loop with lapply
and extract the numeric split parts with grep
lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))
Or we can use stri_split
from stringi
which has the option to remove the empty strings as well (omit_empty=TRUE
).
library(stringi)
stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
#[1] "123-456-789"
stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)
Another option is rm_round
from qdapRegex
if we are interested in extracting the contents inside the brackets.
library(qdapRegex)
rm_round(str1, extract=TRUE)[[1]]
#[1] "123-456-789"
rm_round(str2, extract=TRUE)
data
str1 <- "A B C (123-456-789)"
str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
"(123-423-498) ABCDD",
"(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")