I am new to regex in R. Here I have a vector where I am interested in extracting the first occurance of a number in each string of the vector .
I have a vector calle
You can do this very nicely with the str_first_number()
function from the strex
package, or for more general needs, there's the str_nth_number()
function.
pacman::p_load(strex)
shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
"Pedro Vargas, 42, set fire to his apartment, killed six ...",
"John Zawahri, 23, armed with a homemade assault rifle ...",
"John Zawahri, 23, armed with a homemade assault rifle ...",
"Dennis Clark III, 27, shot and killed his girlfriend ...",
"Kurt Myers, 64, shot six people in neighboring ..."
)
str_first_number(shootsummary)
#> [1] 34 42 23 23 27 64
str_nth_number(shootsummary, n = 1)
#> [1] 34 42 23 23 27 64
Created on 2018-09-03 by the reprex package (v0.2.0).
One option is str_extract
from stringr
with an as.numeric
wrap.
> library(stringr)
> as.numeric(str_extract(shootsummary, "[0-9]+"))
# [1] 34 42 23 27 64
Update In response to your question in the comments of this answer, here's a little explanation. The full explanation of a function can be found in its help file.
str_extract
returns the first occurrence of the regular expression. It is vectorized over the character vector in its first argument.[0-9]+
matches any character of: '0' to '9' (1 or more times)as.numeric
changes the resulting character vector into a numeric vector.R's regmatches()
method returns a vector with the first regex match in each element:
regmatches(shootsummary, regexpr("\\d+", shootsummary, perl=TRUE));
how about
splitbycomma <- strsplit(shootsummary, ",")
as.numeric( sapply(splitbycomma, "[", 2) )
You could try the below sub
command,
> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"
Pattern Explanation:
^
asserts that we are at the start of a line.\D*
Matches zero or more non-digit characters.(\d+)
then the following one or more digits is captured into group 1(first number)..*
Matches any character zero or more times.$
Asserts that we are at the end of a line.stringi
would be faster
library(stringi)
stri_extract_first(shootsummary, regex="\\d+")
#[1] "34" "42" "23" "27" "64"