I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k
If you are looking for speed the obvious solution is stringi package.
There is stri_count_fixed
function for counting patterns.
And now, check the code and benchmark!
DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
colnames(result) <- dna
result
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
fstri <- function(x){
t(sapply(x, stri_count_fixed,dna,T))
}
fbio <- function(x){
t(sapply(x, function(x){x1 <- DNAString(x); oligonucleotideFrequency(x1,2)}))
}
all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
[1] TRUE
longDNA <- sample(DNAlst,100,T)
microbenchmark(fstri(longDNA),fbio(longDNA))
Unit: microseconds
expr min lq mean median uq max neval
fstri(longDNA) 689.378 738.184 825.3014 766.862 793.134 6027.039 100
fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294 100
127245.489/766.862
## [1] 165.9301
Ca 165x times faster :)
My answer wasn't as fast as @bartektartanus. However, it is also pretty fast and I wrote the code... :D
stri_count_fixed
stringi
package will get really slow for big k-mers since it
has to generate all possible combinations for pattern and afterwards,
check their existence in the data and count how many times it appears.k
instead of creating a pattern string. oligonucleotideFrequency
with a k
bigger than 12
in a big sequence, the function freezes for excess of memory use and
R is restarted, while with my function it runs pretty fast.sequence_kmers <- function(sequence, k){
k_mers <- lapply(sequence,function(x){
seq_loop_size <- length(DNAString(x))-k+1
kmers <- sapply(1:seq_loop_size, function(z){
y <- z + k -1
kmer <- substr(x=x, start=z, stop=y)
return(kmer)
})
return(kmers)
})
uniq <- unique(unlist(k_mers))
ind <- t(sapply(k_mers, function(x){
tabulate(match(x, uniq), length(uniq))
}))
colnames(ind) <- uniq
return(ind)
}
I use the Biostrings
package only to count the bases... you can use other options like stringi
to count...
if you remove all code below k_mers lapply
and return(k_mers)
it returns just the list... of all k-mers with the respective repeated vectors
sequence
here is a sequence of 1000bp#same output for 1 or multiple sequences
> sequence_kmers(sequence,4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
4 4 3 4 4 8 6 4 5 5
> sequence_kmers(c(sequence,sequence),4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
[1,] 4 4 3 4 4 8 6 4 5 5
[2,] 4 4 3 4 4 8 6 4 5 5
Tests done with my function:
#super fast for 1 sequence
> system.time({sequence_kmers(sequence,13)})
usuário sistema decorrido
0.08 0.00 0.08
#works fast for 1 sequence or 50 sequences of 1000bps
> system.time({sequence_kmers(rep(sequence,50),4)})
user system elapsed
3.61 0.00 3.61
#same speed for 3-mers or 13-mers
> system.time({sequence_kmers(rep(sequence,50),13)})
user system elapsed
3.63 0.00 3.62
Tests done with Biostrings
:
#Slow 1 sequence 12-mers
> system.time({oligonucleotideFrequency(DNAString(sequence),12)})
user system elapsed
150.11 1.14 151.37
#Biostrings package freezes for a single sequence of 13-mers
> system.time({oligonucleotideFrequency(sequence,13)})
freezes, used all my 8gb RAM
May be this helps
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
t(sapply(DNAlst, function(x){x1 <- DNAString(x)
oligonucleotideFrequency(x1,2)}))
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
#[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
#[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
#[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
#[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
#[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
Or as suggested by @Arun, convert the list
to vector
first
oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L)
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
#[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
#[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
#[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
#[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
#[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
Another way to do this:
DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA","ACACACACACCA")
len <- 4
stri_sub_fun <- function(x) table(stri_sub(x,1:(stri_length(x)-len+1),length = len))
sapply(DNAlst, stri_sub_fun)
[[1]]
AAAC AACT ACTG ATTT CAAA CTGA GATT TGAT TTTT
1 1 1 1 1 1 1 1 1
[[2]]
AAAA AAAG AAAT AAGT AATA ACCG AGTA ATAC ATGA GAAA GATG GTAA TAAA TACC TGAA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[[3]]
ATGC ATTA TATG TTAT
1 1 1 1
[[4]]
TGGA
1
[[5]]
ATCA CATC CGCA CGCG GCAT GCGC TCAA
1 1 1 1 1 1 1
[[6]]
ACAC ACCA CACA CACC
4 1 3 1
We recently released our 'kebabs' package as part of the Bioconductor 3.0 release. Though this package is aimed at providing sequence kernels for classification, regression, and other tasks such as similarity-based clustering, the package includes functionality for computing k-mer frequencies efficiently, too:
#installing kebabs:
#source("http://bioconductor.org/biocLite.R")
#biocLite(c("kebabs", "Biostrings"))
library(kebabs)
s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
s1
s2 <- DNAString(paste0(rep(s1, 200), collate=""))
s2
sk13 <- spectrumKernel(k=13, normalized=FALSE)
system.time(kmerFreq <- drop(getExRep(s1, sk13)))
kmerFreq
system.time(kmerFreq <- drop(getExRep(s2, sk13)))
kmerFreq
So you see that the k-mer frequencies are obtained as the explicit feature vector of the standard (unnormalized) spectrum kernel with k=13. This function is implemented in highly efficient C++ code that builds up a prefix tree and only considers k-mers that actually occur in the sequence (as you requested). You see that even for k=13 and a sequence with tens of thousands of bases, the computations only take fractions of a second (19 msecs on our 5-year-old Dell server). The above function also works for DNAStringSets, but, in this case, you should remove the drop() to get a matrix of k-mer frequencies. The matrix is by default sparse (class 'dgRMatrix'), but you can also enforce the result to be in standard dense matrix format (however, still omitting k-mers that do not occur at all in any of the sequences):
sv <- c(DNAStringSet(s1), DNAStringSet(s2))
system.time(kmerFreq <- getExRep(sv, sk13))
kmerFreq
system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
kmerFreq
How long the k-mers may be, may depend on your system. On our system, the limit seems to be k=22 for DNA sequences. The same works for RNA and amino acid sequences. For the latter, however, the limits in terms of k are significantly lower, since the feature space is obviously much larger for the same k.
#for the kebabs documentation please see:
browseVignettes("kebabs")
I hope that helps. If you have any further questions, please let me know.
Best regards, Ulrich