Merge by Range in R - Applying Loops

前端 未结 3 1060
失恋的感觉
失恋的感觉 2020-11-28 12:59

I posted a question here: Matched Range Merge in R about merging two files based on a number in one file falling into a range in the second file. Thus far, I have been unsuc

相关标签:
3条回答
  • 2020-11-28 13:33

    The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that

    con <- textConnection("SNP     BP
    rs064   12292
    rs319   345367
    rs285   700042")
    snps <- read.delim(con, head=TRUE, sep="")
    
    con <- textConnection("Gene    BP_start    BP_end
    E613    345344      363401
    E92     694501      705408
    E49     362370      368340") ## missing trailing digit on BP_end??
    genes <- read.delim(con, head=TRUE, sep="")
    

    then create 'IRanges' out of each

    library(IRanges)
    isnps <- with(snps, IRanges(BP, width=1, names=SNP))
    igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene)
    

    (pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')

    olaps <- findOverlaps(isnps, igenes)
    

    two of the SNPs overlap

    > queryHits(olaps)
    [1] 2 3
    

    and they overlap the first and second genes

    > subjectHits(olaps)
    [1] 1 2
    

    If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as

    > cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
        SNP     BP Gene BP_start BP_end
    2 rs319 345367 E613   345344 363401
    3 rs285 700042  E92   694501 705408
    

    Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you

    library(GenomicRanges)
    isnps <-
        with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
    igenes <-
        with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))
    
    0 讨论(0)
  • 2020-11-28 13:42

    You can merge by range in base by using apply and which. To return only those which match (inner join):

    do.call(rbind, apply(file1test, 1, function(x) {
      if(length(idx <- which(x["BP"] >= file2$BP_start & x["BP"] <= file2$BP_end)) > 0) {
        cbind(rbind(x), file2[idx,])
      }
    }))
    #      SNP     BP  Gene BP_start BP_end
    #x  rs2343 860269 E3543   860260 879955
    #1   rs754 861822 E3543   860260 879955
    #2   rs754 861822   E11   861322 879533
    #x1  rs854 367934  E613   367640 368634
    

    or all from file1test (left join):

    do.call(rbind, apply(file1test, 1, function(x) {
      if(length(idx <- which(x["BP"] >= file2$BP_start & x["BP"] <= file2$BP_end)) > 0) {
        cbind(rbind(x), file2[idx,])
      } else {
        cbind(rbind(x), file2[1,][NA,])
      }
    }))
    #      SNP     BP  Gene BP_start BP_end
    #x  rs2343 860269 E3543   860260 879955
    #x1  rs211 369640  <NA>       NA     NA
    #1   rs754 861822 E3543   860260 879955
    #2   rs754 861822   E11   861322 879533
    #x2  rs854 367934  E613   367640 368634
    #x3  rs343 706940  <NA>       NA     NA
    #x4  rs626 717244  <NA>       NA     NA
    
    0 讨论(0)
  • 2020-11-28 13:45

    I believe what you're asking for is a conditional join. They're easy in SQL, and the sqldf package makes it easy to query data frames in R using SQL.

    Just pick a version depending on how you want unmatched SNPs handled.

    Inner join version:

    > sqldf("select * from file1test f1 inner join file2 f2 
    +       on (f1.BP > f2.BP_start and f1.BP<= f2.BP_end) ")
    

    Output:

         SNP     BP  Gene BP_start BP_end
    1 rs2343 860269 E3543   860260 879955
    2  rs754 861822 E3543   860260 879955
    3  rs754 861822   E11   861322 879533
    4  rs854 367934  E613   367640 368634
    > 
    

    Left Join version:

    > sqldf("select * from file1test f1 left join file2 f2 
    +       on (f1.BP > f2.BP_start and f1.BP<= f2.BP_end) ")
    

    Output:

         SNP     BP  Gene BP_start BP_end
    1 rs2343 860269 E3543   860260 879955
    2  rs211 369640  <NA>       NA     NA
    3  rs754 861822 E3543   860260 879955
    4  rs754 861822   E11   861322 879533
    5  rs854 367934  E613   367640 368634
    6  rs343 706940  <NA>       NA     NA
    7  rs626 717244  <NA>       NA     NA
    > 
    

    Note that you may want to be careful where you place the = if it matters which group a BP will fall in for the case where a BP exactly matches a BP_start or BP_end.

    0 讨论(0)
提交回复
热议问题