str_extract specific patterns (example)

前端 未结 4 1767
轻奢々
轻奢々 2021-01-06 12:04

I\'m still a little confused by regex syntax. Can you please help me with these patterns:

_A00_A1234B_
_A00_A12345B_
_A1_A12345_

my approac

相关标签:
4条回答
  • 2021-01-06 12:43
    vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
    

    You can use sub and this regex:

    sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
    # [1] "A1234B"  "A12345B" "A12345" 
    
    0 讨论(0)
  • 2021-01-06 12:45

    Using rex to construct the regular expression may make it more understandable.

    x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
    
    # approach #1, assumes always is between the second underscores.
    re_matches(x,
      rex(
        "_",
        anything,
        "_",
        capture(anything),
        "_"
      )
    )
    
    #>         1
    #> 1  A1234B
    #> 2 A12345B
    #> 3  A12345
    
    
    # approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
    re_matches(x,
      rex(
        capture(
          alpha,
          between(digit, 4, 5),
          maybe(alpha)
        )
      )
    )
    
    #>         1
    #> 1  A1234B
    #> 2 A12345B
    #> 3  A12345
    
    0 讨论(0)
  • 2021-01-06 12:58

    You can do this without using a regular expression ...

    x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
    sapply(strsplit(x, '_', fixed=T), '[', 3)
    # [1] "A1234B"  "A12345B" "A12345" 
    

    If you insist on using a regular expression, the following will suffice.

    regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))
    
    0 讨论(0)
  • 2021-01-06 13:05

    You can try

    library(stringr)
    str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
    #[1] "A1234B"  "A12345B" "A12345" 
    

    Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?

    Or you can use stringi which would be faster

    library(stringi)
     stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
     #[1] "A1234B"  "A12345B" "A12345" 
    

    Or a base R option would be

     regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
     #[1] "A1234B"  "A12345B" "A12345" 
    

    data

    str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
    
    0 讨论(0)
提交回复
热议问题