how to extract the first number from each string in a vector in R?

后端 未结 7 1118
醉话见心 2020-12-18 10:53

I am new to regex in R. Here I have a vector where I am interested in extracting the first occurance of a number in each string of the vector .

I have a vector calle

  • 2020-12-18 11:06

    You can do this very nicely with the str_first_number() function from the strex package, or for more general needs, there's the str_nth_number() function.

    shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
                      "Pedro Vargas, 42, set fire to his apartment, killed six ...",
                      "John Zawahri, 23, armed with a homemade assault rifle ...",
                      "John Zawahri, 23, armed with a homemade assault rifle ...",
                      "Dennis Clark III, 27, shot and killed his girlfriend ...",
                      "Kurt Myers, 64, shot six people in neighboring ..."
    #> [1] 34 42 23 23 27 64
    str_nth_number(shootsummary, n = 1)
    #> [1] 34 42 23 23 27 64

    Created on 2018-09-03 by the reprex package (v0.2.0).

    0 讨论(0)
  • 2020-12-18 11:10

    One option is str_extract from stringr with an as.numeric wrap.

    > library(stringr)
    > as.numeric(str_extract(shootsummary, "[0-9]+"))
    # [1] 34 42 23 27 64

    Update In response to your question in the comments of this answer, here's a little explanation. The full explanation of a function can be found in its help file.

    • str_extract returns the first occurrence of the regular expression. It is vectorized over the character vector in its first argument.
    • The regular expression [0-9]+ matches any character of: '0' to '9' (1 or more times)
    • as.numeric changes the resulting character vector into a numeric vector.
    0 讨论(0)
  • 2020-12-18 11:10

    R's regmatches() method returns a vector with the first regex match in each element:

    regmatches(shootsummary, regexpr("\\d+", shootsummary, perl=TRUE));
    0 讨论(0)
  • 2020-12-18 11:13

    how about

    splitbycomma <- strsplit(shootsummary, ",")
    as.numeric(  sapply(splitbycomma, "[", 2)  )
    0 讨论(0)
  • 2020-12-18 11:17

    You could try the below sub command,

    > test
    [1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."              
    [2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
    > sub("^\\D*(\\d+).*$", "\\1", test)
    [1] "34" "42"

    Pattern Explanation:

    • ^ asserts that we are at the start of a line.
    • \D* Matches zero or more non-digit characters.
    • (\d+) then the following one or more digits is captured into group 1(first number).
    • .* Matches any character zero or more times.
    • $ Asserts that we are at the end of a line.
    • Finally all the matched chars are replaced by the chars which are present inside the first group .
    0 讨论(0)
  • 2020-12-18 11:22

    stringi would be faster

    stri_extract_first(shootsummary, regex="\\d+")
    #[1] "34" "42" "23" "27" "64"
    0 讨论(0)