Using regexp to select rows in R dataframe

前端 未结 7 1086
终归单人心
终归单人心 2020-12-08 00:48

I\'m trying to select rows in a dataframe where the string contained in a column matches either a regular expression or a substring:

dataframe:

相关标签:
7条回答
  • 2020-12-08 01:31

    I tested using Expresso and used .Net-style regexes; you may have to tweak for your regex flavor. I also left whitespace in for readability; remove or use a regex option flag to ignore.

    The basic regex to capture all lines is:

    (?<aName> [\w-]+ ) \s+ (?<bName> [\w_]+ ) \s+ (?<pName> [\w-_]+ ) \s+ (?<call> \w+ ) \s+ (?<alleles> \w+ ) \s+ (?<logRatio> [\d\.-]+ ) \s+ (?<strength> [\d\.-]+ ) 
    

    From this, you just need to tweak the regex for the appropriate named capture group(s) to extract only the lines that you want. The modified version to capture using the criteria you gave (bName contains "ADN" and pName = "2011-02-10_R2") is:

    (?<aName> [\w-]+ ) \s+ (?<bName> [\w_]*ADN[\w_]* ) \s+ (?<pName> 2011-02-10_R2 ) \s+ (?<call> \w+ ) \s+ (?<alleles> \w+ ) \s+ (?<logRatio> [\d\.-]+ ) \s+ (?<strength> [\d\.-]+ ) 
    
    0 讨论(0)
  • 2020-12-08 01:41

    Here you go.

    First recreate your data:

    dat <- read.table(text="
    aName   bName   pName   call  alleles   logRatio    strength
    AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
    AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
    AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
    AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
    AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
    AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
    AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
    AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801
    ", header=TRUE)
    

    Next, use grepl to construct a logical index of matches:

    index1 <- with(dat, grepl("ADN", bName))
    index2 <- with(dat, grepl("2011-02-10_R2", pName))
    

    Now subset using the & operator:

    dat[index1 & index2, ]
            aName      bName         pName call alleles  logRatio strength
    7 AX-11086564 D04_ADN103 2011-02-10_R2   BB      GG -1.898088 9.872966
    
    0 讨论(0)
  • 2020-12-08 01:45
    subset(dat, grepl("ADN", bName)  &  pName == "2011-02-10_R2" )
    

    Note "&" (and not "&&" which is not vectorized) and that "==" (and not"=" which is assignment).

    Note that you could have used:

     dat[ with(dat,  grepl("ADN", bName)  &  pName == "2011-02-10_R2" ) , ]
    

    ... and that might be preferable when used inside functions, however, that will return NA values for any lines where dat$pName is NA. That defect (which some regard as a feature) could be removed by the addition of & !is.na(dat$pName) to the logical expression.

    0 讨论(0)
  • 2020-12-08 01:45

    Why not just:

    grep 'ADN'|grep '2011-02-10_R2'
    

    You could also do this:

    grep -P '\t.{4}(ADN).*(2011-02-10_R2).*'
    
    0 讨论(0)
  • 2020-12-08 01:48

    The same logic as above

    df %>% 
      filter(grepl("ADN", bName) & grepl("2011-02-10_R2", pName))
    #     aName      bName         pName call alleles  logRatio     strength
    # 1 AX-11086564 D04_ADN103 2011-02-10_R2   BB      GG -1.898088 9.872966
    
    0 讨论(0)
  • 2020-12-08 01:51

    This is a pretty minimal solution using dplyr and magrittr which I think is what you are after:

    Data:
    library(magrittr)
    library(stringr)
    dat <- read.table(text="
    aName   bName   pName   call  alleles   logRatio    strength
                      AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
                      AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
                      AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
                      AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
                      AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
                      AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
                      AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
                      AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801
                      ", header=TRUE)
    

    rows that contain ADN in column bName.

    dat %>%
      filter(str_detect(bName, "ADN") == TRUE)
    

    Secondarily, I would like all rows that contain ADN in column bName and that match 2011-02-10_R2 in column pName.

    dat %>%
      filter(str_detect(bName, "ADN") & pName == "2011-02-10_R2") 
    
    0 讨论(0)
提交回复
热议问题