I\'m trying to select rows in a dataframe where the string contained in a column matches either a regular expression or a substring:
dataframe:
I tested using Expresso and used .Net-style regexes; you may have to tweak for your regex flavor. I also left whitespace in for readability; remove or use a regex option flag to ignore.
The basic regex to capture all lines is:
(?<aName> [\w-]+ ) \s+ (?<bName> [\w_]+ ) \s+ (?<pName> [\w-_]+ ) \s+ (?<call> \w+ ) \s+ (?<alleles> \w+ ) \s+ (?<logRatio> [\d\.-]+ ) \s+ (?<strength> [\d\.-]+ )
From this, you just need to tweak the regex for the appropriate named capture group(s) to extract only the lines that you want. The modified version to capture using the criteria you gave (bName contains "ADN" and pName = "2011-02-10_R2") is:
(?<aName> [\w-]+ ) \s+ (?<bName> [\w_]*ADN[\w_]* ) \s+ (?<pName> 2011-02-10_R2 ) \s+ (?<call> \w+ ) \s+ (?<alleles> \w+ ) \s+ (?<logRatio> [\d\.-]+ ) \s+ (?<strength> [\d\.-]+ )
Here you go.
First recreate your data:
dat <- read.table(text="
aName bName pName call alleles logRatio strength
AX-11086564 F08_ADN103 2011-02-10_R10 AB CG 0.363371 10.184215
AX-11086564 A01_CD1919 2011-02-24_R11 BB GG -1.352707 9.54909
AX-11086564 B05_CD2920 2011-01-27_R6 AB CG -0.183802 9.766334
AX-11086564 D04_CD5950 2011-02-09_R9 AB CG 0.162586 10.165051
AX-11086564 D07_CD6025 2011-02-10_R10 AB CG -0.397097 9.940238
AX-11086564 B05_CD3630 2011-02-02_R7 AA CC 2.349906 9.153076
AX-11086564 D04_ADN103 2011-02-10_R2 BB GG -1.898088 9.872966
AX-11086564 A01_CD2588 2011-01-27_R5 BB GG -1.208094 9.239801
", header=TRUE)
Next, use grepl
to construct a logical index of matches:
index1 <- with(dat, grepl("ADN", bName))
index2 <- with(dat, grepl("2011-02-10_R2", pName))
Now subset using the &
operator:
dat[index1 & index2, ]
aName bName pName call alleles logRatio strength
7 AX-11086564 D04_ADN103 2011-02-10_R2 BB GG -1.898088 9.872966
subset(dat, grepl("ADN", bName) & pName == "2011-02-10_R2" )
Note "&" (and not "&&" which is not vectorized) and that "==" (and not"=" which is assignment).
Note that you could have used:
dat[ with(dat, grepl("ADN", bName) & pName == "2011-02-10_R2" ) , ]
... and that might be preferable when used inside functions, however, that will return NA values for any lines where dat$pName is NA. That defect (which some regard as a feature) could be removed by the addition of & !is.na(dat$pName)
to the logical expression.
Why not just:
grep 'ADN'|grep '2011-02-10_R2'
You could also do this:
grep -P '\t.{4}(ADN).*(2011-02-10_R2).*'
The same logic as above
df %>%
filter(grepl("ADN", bName) & grepl("2011-02-10_R2", pName))
# aName bName pName call alleles logRatio strength
# 1 AX-11086564 D04_ADN103 2011-02-10_R2 BB GG -1.898088 9.872966
This is a pretty minimal solution using dplyr and magrittr which I think is what you are after:
Data:
library(magrittr)
library(stringr)
dat <- read.table(text="
aName bName pName call alleles logRatio strength
AX-11086564 F08_ADN103 2011-02-10_R10 AB CG 0.363371 10.184215
AX-11086564 A01_CD1919 2011-02-24_R11 BB GG -1.352707 9.54909
AX-11086564 B05_CD2920 2011-01-27_R6 AB CG -0.183802 9.766334
AX-11086564 D04_CD5950 2011-02-09_R9 AB CG 0.162586 10.165051
AX-11086564 D07_CD6025 2011-02-10_R10 AB CG -0.397097 9.940238
AX-11086564 B05_CD3630 2011-02-02_R7 AA CC 2.349906 9.153076
AX-11086564 D04_ADN103 2011-02-10_R2 BB GG -1.898088 9.872966
AX-11086564 A01_CD2588 2011-01-27_R5 BB GG -1.208094 9.239801
", header=TRUE)
rows that contain ADN in column bName.
dat %>%
filter(str_detect(bName, "ADN") == TRUE)
Secondarily, I would like all rows that contain ADN in column bName and that match 2011-02-10_R2 in column pName.
dat %>%
filter(str_detect(bName, "ADN") & pName == "2011-02-10_R2")