Common genomic intervals in R

£可爱£侵袭症+ 提交于 2019-12-04 14:21:48

If your data is in a data.frame (see below), using the Bioconductor GenomicRanges package I create a GRanges instance, keeping the non-range columns too

library(GenomicRanges)
gr <- makeGRangesFromDataFrame(df, TRUE)

The discrete ranges represented by the data are given by the disjoin function, and the overlap between the disjoint ranges ('query') and your original ('subject') are

d <- disjoin(gr)
olaps <- findOverlaps(d, gr)

Split the sample information associated with each overlapping subject with the corresponding query, and associate it with the disjoint GRanges as

mcols(d) <- splitAsList(gr$sample[subjectHits(olaps)], queryHits(olaps))

leading to for instance

> d[elementLengths(d$value) > 1]
GRanges with 2 ranges and 1 metadata column:
      seqnames     ranges strand |           value
         <Rle>  <IRanges>  <Rle> | <CharacterList>
  [1]        1 [100, 150]      * |     NE001,NE002
  [2]        2 [100, 150]      * |     NE001,NE002
  ---
  seqlengths:
    1  2
   NA NA

Here's how I input your data:

txt <- "sample    chr start end
NE001      1   100  200
NE001      2   100  200
NE002      1   50   150
NE002      2   50   150
NE003      2   250  300"
df <- read.table(textConnection(txt), header=TRUE, stringsAsFactors=FALSE)

Given the context behind this question, I suspect it's going to be worthwhile your learning the GenomicRanges package from Bioconductor.

library(GenomicRanges)
gr <- GRanges(seqnames=df$chr, ranges=IRanges(start=df$start, end=df$end))
ov <- findOverlaps(gr,gr, type="any")
ov <- ov[queryHits(ov) != subjectHits(ov)]
between <- pintersect(gr[subjectHits(ov)], gr[queryHits(ov)])

The approach being: find all self-overlaps, remove the trivial ones where an interval is being compared to itself (4th line), and then finding the intersection between each pair of remaining intervals. You can then tabulate the results however you wish.

This is certainly very long (and likely very inefficient on large data.frames given the expand.grid.df, however, I hope it gives you a starting point. As a caveat, I have no background in genomics (which I'm sure comes through) so had no idea of common packages for this. Surely those are the best way to go. I just thought it would be fun to attempt a solution.

s<-"sample    chr start end
NE001      1   100  200
NE001      2   100  200
NE002      1   50   150
NE002      2   50   150
NE003      2   250  300"

dat<-read.table(text=s, header=T)

library(plyr)
between<-function(x,y,z) x<=y & y<=z
dat$id<-seq_along(dat[,1])
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
expdat<-ddply(dat, .(chr), function(x) expand.grid.df(x,x))
expdat<-subset(expdat, id.x!=id.y)
expdat$betweenL<-with(expdat, between(start.y, start.x, end.y))
expdat$betweenR<-with(expdat, between(start.x, start.y, end.x))
expdat<-subset(expdat, betweenL | betweenR)
expdat$commonstart<-with(expdat,ifelse(betweenL,start.x,start.y))
expdat$commonend<-with(expdat, ifelse(betweenL, end.y, end.x))
res<-ddply(expdat, .(chr, commonstart, commonend),summarize, freq=length(sample.x))
> res
  chr commonstart commonend freq
1   1         100       150    2
2   2         100       150    2
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!