I have these strings:
myseq <- c(\"ALM_GSK_LN_06.ID\",\"AS04_LV_06.ID.png\",\"AS04_SP_06.IP.png\")
What I want to do is to capture parts of
You are pretty close. Here is a small adjustment:
str_match(myseq, "(.+)_(LN|LV|SP)_06\\.([A-Z]+)")[, -1]
produces:
[,1] [,2] [,3]
[1,] "ALM_GSK" "LN" "ID"
[2,] "AS04" "LV" "ID"
[3,] "AS04" "SP" "IP"
Yours doesn't work because your first token matches neither numbers or underscores, which you need for "AS04" (numbers) and "ALM_GSK" (underscores).
Your regular expression incorrectly matches the prefix because [A-Z]+
only matches letters. To fix this simply change the first group to a greedy operator such as (.+)
, here is another solution.
library(gsubfn)
myseq <- c('ALM_GSK_LN_06.ID', 'AS04_LV_06.ID.png', 'AS04_SP_06.IP.png')
strapply(myseq, '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)', c, simplify = rbind)
# [,1] [,2] [,3]
# [1,] "ALM_GSK" "LN" "ID"
# [2,] "AS04" "LV" "ID"
# [3,] "AS04" "SP" "IP"
Totally stealing @hwnd's regex but in a tidyr/dplyr approach:
library(dplyr); library(tidyr)
data_frame(myseq) %>%
extract(myseq, c('A', 'B', 'C'), '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)')
## A B C
## 1 ALM_GSK LN ID
## 2 AS04 LV ID
## 3 AS04 SP IP