How to subset data with advance string matching

前端 未结 2 1528
误落风尘
误落风尘 2021-02-03 10:33

I have the following data frame from which I would like to extract rows based on matching strings.

> GEMA_EO5
gene_symbol  fold_EO  p_value                            


        
相关标签:
2条回答
  • 2021-02-03 11:25

    To do partial matching you'll need to use regular expressions (see ?grepl). Here's a solution to your particular problem:

    ##Notice that the first element appears in 
    ##a row containing commas
    l = c( "NM_013433", "NM_001386", "NM_020385")
    

    To test one sequence at a time, we just select a particular seq id:

    R> subset(GEMA_EO5, grepl(l[1], GEMA_EO5$RefSeq_ID))
      gene_symbol fold_EO p_value                           RefSeq_ID BH_p_value
    5       TNPO2   4.708 1.6e-23 NM_001136195,NM_001136196,NM_013433  1.538e-20
    

    To test for multiple genes, we use the | operator:

    R> paste(l, collapse="|")
    [1] "NM_013433|NM_001386|NM_020385"
    R> grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID)
    [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
    

    So

    subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
    

    should give you what you want.

    0 讨论(0)
  • 2021-02-03 11:38

    A different approach is to recognize the duplicate entries in RefSeq_ID as an attempt to represent two data base tables in a single data frame. So if the original table is csv, then normalize the data into two tables

    Anno <- cbind(key = seq_len(nrow(csv)), csv[,names(csv) != "RefSeq_ID"])
    key0 <- strsplit(csv$RefSeq_ID, ",")
    RefSeq <- data.frame(key = rep(seq_along(key0), sapply(key0, length)),
                         ID = unlist(key0))
    

    and recognize that the query is a subset (select) on the RefSeq table, followed by a merge (join) with Anno

    l <- c( "NM_013433", "NM_001386", "NM_020385")
    merge(Anno, subset(RefSeq, ID %in% l))[, -1]
    

    leading to

    > merge(Anno, subset(RefSeq, ID %in% l))[, -1]
      gene_symbol  fold_EO  p_value   BH_p_value        ID
    1       REXO4 3.245317 1.78e-27 2.281367e-24 NM_020385
    2       TNPO2 4.707600 1.60e-23 1.538000e-20 NM_013433
    3      DPYSL2 5.097382 1.29e-22 1.062868e-19 NM_001386
    

    Perhaps the goal is to merge with a `Master' table, then

    Master <- cbind(key = seq_len(nrow(csv)), csv)
    merge(Master, subset(RefSeq, ID %in% l))[,-1]
    

    or similar.

    0 讨论(0)
提交回复
热议问题