问题
I mapped my reads to my assembly using the bwa mem
algorithm and extracted the number of reads per base (= coverage) using samtools depth
. The resulting file is the following:
1091900001 1 236
1091900001 2 245
1091900001 3 265
1091900001 4 283
1091900001 5 288
1091900002 1 297
1091900002 2 312
1091900002 3 327
1091900002 4 338
1091900002 5 348
with three columns: name of the contig (since it is a multi-contig file, this ID changes) - position (base) - number of reads that mapped (coverage).
Now I want to calculate the coverage (third column) in sliding windows; in a window size of 3 and slide of 2 as the mean - per contig (first column).
I want to use the rollapply
function of the zoo
package.
require(zoo)
cov <- read.table("file",header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
library(reshape) #loads the library to rename the column names
cov<-rename(cov,c(V1="Chr", V2="locus", V3="depth")) #renames the header
rollapply(cov$depth, width = 3, by = 2, FUN = mean, align = "left")
But this of course does not take into account the contig. Plus, my expected output should include the contig-info & the window, it was calculated:
1091900001 1 3 248.6667
1091900001 3 5 278.6667
1091900002 1 3 312.0000
1091900002 3 5 337.6667
Is there an easy way for doing that in R
?
回答1:
Here's how you can do this with the dplyr
functions group_by
and do
:
library(dplyr)
cov %>%
group_by(Chr) %>%
do(
data.frame(
window.start = rollapply(.$locus, width=3, by=2, FUN=min, align="left"),
window.end = rollapply(.$locus, width=3, by=2, FUN=max, align="left"),
coverage = rollapply(.$depth, width=3, by=2, FUN=mean, align="left")
)
)
# # A tibble: 4 x 4
# # Groups: Chr [2]
# Chr window.start window.end coverage
# <int> <int> <int> <dbl>
# 1 1091900001 1 3 248.6667
# 2 1091900001 3 5 278.6667
# 3 1091900002 1 3 312.0000
# 4 1091900002 3 5 337.6667
do
allows you to return an arbitrary number of values from grouped operations in the form of a data.frame. In this case, we return a the rolling mean of the coverage value, along with the min
and max
values from locus
from each window.
Edit:
If your dataset is large, you may be better off performing the calculation using data.table. Its syntax is a bit harder to understand if you haven't seen it before, but it can offer substantial speed improvements in grouped operations on bigger data. Here's how your operation works with data.table
:
library(data.table)
setDT(cov)
cov[, .(
window.start = rollapply(locus, width=3, by=2, FUN=min, align="left"),
window.end = rollapply(locus, width=3, by=2, FUN=max, align="left"),
coverage = rollapply(depth, width=3, by=2, FUN=mean, align="left")
),
.(Chr)]
Based on the sample rows you provided, here are the benchmark results for the dplyr
and data.table
approaches (measured in milliseconds):
# dplyr:
min lq mean median uq max neval
7.811753 8.685976 10.10268 9.243551 10.42691 144.5274 1000
# data.table:
min lq mean median uq max neval
1.924472 2.105459 2.510832 2.30479 2.685706 8.848451 1000
So on the sample data, the data.table
option is about 4x faster on average.
来源:https://stackoverflow.com/questions/48721332/genome-coverage-as-sliding-window