Find point-to-range overlaps

后端 未结 3 1320
你的背包
你的背包 2021-01-22 19:18

I have a dataframe df1:

df1 <- read.table(text=\" Chr06  79641   
Chr06   82862   
Chr06   387314  
Chr06   656098  
Chr06   678491  
Chr06   1018696\", heade         


        
相关标签:
3条回答
  • You can do this using sapply:

    sapply(1:nrow(df1), function(x) any(df1[x,2] >= df2$V2 &
                                        df1[x,2] <= df2$V3 &
                                        df1[x, 1] == df2$V1))
    [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE
    
    0 讨论(0)
  • 2021-01-22 20:04

    Using GenomicRanges:

    #Convert to Granges objects
    gr1 <- GRanges(seqnames = df1$V1,
                   ranges = IRanges(df1$V2, df1$V2))
    
    gr2 <- GRanges(seqnames = df2$V1,
                   ranges = IRanges(df2$V2, df2$V3))
    #Subset gr1
    subsetByOverlaps(gr1, gr2)
    
    # GRanges object with 3 ranges and 0 metadata columns:
    #       seqnames             ranges strand
    #          <Rle>          <IRanges>  <Rle>
    #  [1]    Chr06 [  82862,   82862]      *
    #  [2]    Chr06 [ 387314,  387314]      *
    #  [3]    Chr06 [1018696, 1018696]      *
    #   -------
    #   seqinfo: 1 sequence from an unspecified genome; no seqlengths
    
    #Or we can use merge
    mergeByOverlaps(gr1, gr2)
    
    # DataFrame with 3 rows and 2 columns
    #                          gr1                        gr2
    #                    <GRanges>                  <GRanges>
    # 1 Chr06:*:[  82862,   82862] Chr06:*:[  79720,   87043]
    # 2 Chr06:*:[ 387314,  387314] Chr06:*:[ 387314,  387371]
    # 3 Chr06:*:[1018696, 1018696] Chr06:*:[1018676, 1018736]
    

    Also, look into bedtools:

    Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

    0 讨论(0)
  • 2021-01-22 20:18

    Here is a data.table solution as an alternative to GenomicRanges:

    library(data.table)
    dt1 <- data.table(df1)[, V3 := V2]
    dt2 <- data.table(df2, key = c("V2", "V3"))
    foverlaps(dt1, dt2)[V1 == i.V1][, -c(4, 6), with = F]
    #       V1      V2      V3    i.V3
    # 1: Chr06   79720   87043   82862
    # 2: Chr06  387314  387371  387314
    # 3: Chr06 1018676 1018736 1018696
    
    0 讨论(0)
提交回复
热议问题