Capturing parts of string using regular expression in R

前端 未结 3 1849
盖世英雄少女心
盖世英雄少女心 2021-01-21 19:01

I have these strings:

myseq <- c(\"ALM_GSK_LN_06.ID\",\"AS04_LV_06.ID.png\",\"AS04_SP_06.IP.png\")

What I want to do is to capture parts of

相关标签:
3条回答
  • 2021-01-21 19:24

    You are pretty close. Here is a small adjustment:

    str_match(myseq, "(.+)_(LN|LV|SP)_06\\.([A-Z]+)")[, -1]
    

    produces:

         [,1]      [,2] [,3]
    [1,] "ALM_GSK" "LN" "ID"
    [2,] "AS04"    "LV" "ID"
    [3,] "AS04"    "SP" "IP"
    

    Yours doesn't work because your first token matches neither numbers or underscores, which you need for "AS04" (numbers) and "ALM_GSK" (underscores).

    0 讨论(0)
  • 2021-01-21 19:30

    Your regular expression incorrectly matches the prefix because [A-Z]+ only matches letters. To fix this simply change the first group to a greedy operator such as (.+), here is another solution.

    library(gsubfn)
    myseq <- c('ALM_GSK_LN_06.ID', 'AS04_LV_06.ID.png', 'AS04_SP_06.IP.png')
    strapply(myseq, '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)', c, simplify = rbind)
    
    #      [,1]      [,2] [,3]
    # [1,] "ALM_GSK" "LN" "ID"
    # [2,] "AS04"    "LV" "ID"
    # [3,] "AS04"    "SP" "IP"
    
    0 讨论(0)
  • 2021-01-21 19:32

    Totally stealing @hwnd's regex but in a tidyr/dplyr approach:

    library(dplyr); library(tidyr)
    data_frame(myseq) %>%
        extract(myseq, c('A', 'B', 'C'), '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)')
    
    ##         A  B  C
    ## 1 ALM_GSK LN ID
    ## 2    AS04 LV ID
    ## 3    AS04 SP IP
    
    0 讨论(0)
提交回复
热议问题