Collapse intersecting regions

后端未结

关注

 2  896

I am trying to find a way to collapse rows with intersecting ranges, denoted by \"start\" and \"stop\" columns, and record the collapsed values into new columns. For example

相关标签:

2条回答

感情败类

2020-12-03 02:37

IRanges is a good candidate for such job. No need to use chrom variable.

ir <- IRanges(my.df$start, my.df$stop)
## I create a new grouping variable Note the use of reduce here(performance issue)
my.df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
# chrom name    start     stop group2
# 1     1    a    70001    71200      2
# 2     1    b    70203    80001      2
# 3     1    c    70060    71051      2
# 4    14    d    40004    42004      1
# 5    16    e 50000872 50000890      3
# 6    16    f 50000872 51000952      3

The new group2 variable is the range indicator. Now using data.table I can't transform my data to the desired output:

library(data.table)
DT <- as.data.table(my.df)
DT[, list(start=min(start),stop=max(stop),
         name=list(name),chrom=unique(chrom)),
               by=group2]

# group2    start     stop  name chrom
# 1:      2    70001    80001 a,b,c     1
# 2:      1    40004    42004     d    14
# 3:      3 50000872 51000952   e,f    16

PS: the collapsed variable name here is not string but a list of factor. This is more efficient and easier to access than a collapased character using paste for example.

EDIT after OP clarification, I will create the group variable by chrom. I mean the Iranges code now is called for each chrom group. I slightly modify your data, to create group of intervals the same chromosome.

my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), 
                   name=c("a","b","c","d","e","f","g"),
                   start=as.numeric(c(0,3000,70203,70060, 40004, 50000872, 50000872)), 
                   stop=as.numeric(c(1,5000,80001,71051, 42004, 50000890, 51000952)))

library(data.table)
DT <- as.data.table(my.df)

## find interval for each chromsom
DT[,group := { 
      ir <-  IRanges(start, stop);
       subjectHits(findOverlaps(ir, reduce(ir)))
      },by=chrom]

## Now I group by group and chrom 
DT[, list(start=min(start),stop=max(stop),name=list(name),chrom=unique(chrom)),
   by=list(group,chrom)]

  group chrom    start     stop name chrom
1:     1     1        0        1    a     1
2:     2     1     3000     5000    b     1
3:     3     1    70060    80001  c,d     1
4:     1    14    40004    42004    e    14
5:     1    16 50000872 51000952  f,g    16

0 讨论(0)

不知归路

2020-12-03 02:53

After sorting the data, you can easily test if an interval overlaps the previous one, and assign a label to each set of overlapping intervals. Once you have those labels, you can use ddply to aggregate the data.

d <- data.frame(
  chrom = c(1,1,1,14,16,16), 
  name  = c("a","b","c","d","e","f"), 
  start = as.numeric(c(70001,70203,70060, 40004, 50000872, 50000872)), 
  stop  = as.numeric(c(71200,80001,71051, 42004, 50000890, 51000952))
)

# Make sure the data is sorted
d <- d[ order(d$start), ]

# Check if a record should be linked with the previous
d$previous_stop <- c(NA, d$stop[-nrow(d)])
d$previous_stop <- cummax(ifelse(is.na(d$previous_stop),0,d$previous_stop))
d$new_group <- is.na(d$previous_stop) | d$start >= d$previous_stop

# The number of the current group of records is the number of times we have switched to a new group
d$group <- cumsum( d$new_group )

# We can now aggregate the data
library(plyr)
ddply( 
  d, "group", summarize, 
  start=min(start), stop=max(stop), name=paste(name,collapse=",")
)
#   group    start     stop    name
# 1     1        0    80001 a,d,c,b
# 2     2 50000872 51000952     e,f

But this ignores the chrom column: to account for it, you can do the same thing for each chromosome, separately.

d <- d[ order(d$chrom, d$start), ]
d <- ddply( d, "chrom", function(u) { 
  x <- c(NA, u$stop[-nrow(u)])
  y <- ifelse( is.na(x), 0, x )
  y <- cummax(y)
  y[ is.na(x) ] <- NA
  u$previous_stop <- y
  u
} )
d$new_group <- is.na(d$previous_stop) | d$start >= d$previous_stop
d$group <- cumsum( d$new_group )
ddply( 
  d, .(chrom,group), summarize, 
  start=min(start), stop=max(stop), name=paste(name,collapse=",")
)
#   chrom group    start     stop  name
# 1     1     1        0    80001 a,c,b
# 2    14     2    40004    42004     d
# 3    16     3 50000872 51000952   e,f

0 讨论(0)