I would like to infer shared genomic interval between different samples.
My input:
sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300
My expected output:
chr start end freq
1 100 150 2
2 100 150 2
Where the "freq" is the how many samples have contribuited to infer the shared region. In the above example freq = 2 (NE001 and NE002).
Cheers!
If your data is in a data.frame (see below), using the Bioconductor GenomicRanges package I create a GRanges instance, keeping the non-range columns too
library(GenomicRanges)
gr <- makeGRangesFromDataFrame(df, TRUE)
The discrete ranges represented by the data are given by the disjoin
function, and the overlap between the disjoint ranges ('query') and your original ('subject') are
d <- disjoin(gr)
olaps <- findOverlaps(d, gr)
Split the sample information associated with each overlapping subject with the corresponding query, and associate it with the disjoint GRanges as
mcols(d) <- splitAsList(gr$sample[subjectHits(olaps)], queryHits(olaps))
leading to for instance
> d[elementLengths(d$value) > 1]
GRanges with 2 ranges and 1 metadata column:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <CharacterList>
[1] 1 [100, 150] * | NE001,NE002
[2] 2 [100, 150] * | NE001,NE002
---
seqlengths:
1 2
NA NA
Here's how I input your data:
txt <- "sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300"
df <- read.table(textConnection(txt), header=TRUE, stringsAsFactors=FALSE)
Given the context behind this question, I suspect it's going to be worthwhile your learning the GenomicRanges
package from Bioconductor.
library(GenomicRanges)
gr <- GRanges(seqnames=df$chr, ranges=IRanges(start=df$start, end=df$end))
ov <- findOverlaps(gr,gr, type="any")
ov <- ov[queryHits(ov) != subjectHits(ov)]
between <- pintersect(gr[subjectHits(ov)], gr[queryHits(ov)])
The approach being: find all self-overlaps, remove the trivial ones where an interval is being compared to itself (4th line), and then finding the intersection between each pair of remaining intervals. You can then tabulate the results however you wish.
This is certainly very long (and likely very inefficient on large data.frames given the expand.grid.df, however, I hope it gives you a starting point. As a caveat, I have no background in genomics (which I'm sure comes through) so had no idea of common packages for this. Surely those are the best way to go. I just thought it would be fun to attempt a solution.
s<-"sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300"
dat<-read.table(text=s, header=T)
library(plyr)
between<-function(x,y,z) x<=y & y<=z
dat$id<-seq_along(dat[,1])
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
expdat<-ddply(dat, .(chr), function(x) expand.grid.df(x,x))
expdat<-subset(expdat, id.x!=id.y)
expdat$betweenL<-with(expdat, between(start.y, start.x, end.y))
expdat$betweenR<-with(expdat, between(start.x, start.y, end.x))
expdat<-subset(expdat, betweenL | betweenR)
expdat$commonstart<-with(expdat,ifelse(betweenL,start.x,start.y))
expdat$commonend<-with(expdat, ifelse(betweenL, end.y, end.x))
res<-ddply(expdat, .(chr, commonstart, commonend),summarize, freq=length(sample.x))
> res
chr commonstart commonend freq
1 1 100 150 2
2 2 100 150 2
来源:https://stackoverflow.com/questions/23084322/common-genomic-intervals-in-r