I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k
My answer wasn't as fast as @bartektartanus. However, it is also pretty fast and I wrote the code... :D
stri_count_fixed
stringi
package will get really slow for big k-mers since it
has to generate all possible combinations for pattern and afterwards,
check their existence in the data and count how many times it appears.k
instead of creating a pattern string. oligonucleotideFrequency
with a k
bigger than 12
in a big sequence, the function freezes for excess of memory use and
R is restarted, while with my function it runs pretty fast.sequence_kmers <- function(sequence, k){
k_mers <- lapply(sequence,function(x){
seq_loop_size <- length(DNAString(x))-k+1
kmers <- sapply(1:seq_loop_size, function(z){
y <- z + k -1
kmer <- substr(x=x, start=z, stop=y)
return(kmer)
})
return(kmers)
})
uniq <- unique(unlist(k_mers))
ind <- t(sapply(k_mers, function(x){
tabulate(match(x, uniq), length(uniq))
}))
colnames(ind) <- uniq
return(ind)
}
I use the Biostrings
package only to count the bases... you can use other options like stringi
to count...
if you remove all code below k_mers lapply
and return(k_mers)
it returns just the list... of all k-mers with the respective repeated vectors
sequence
here is a sequence of 1000bp#same output for 1 or multiple sequences
> sequence_kmers(sequence,4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
4 4 3 4 4 8 6 4 5 5
> sequence_kmers(c(sequence,sequence),4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
[1,] 4 4 3 4 4 8 6 4 5 5
[2,] 4 4 3 4 4 8 6 4 5 5
Tests done with my function:
#super fast for 1 sequence
> system.time({sequence_kmers(sequence,13)})
usuário sistema decorrido
0.08 0.00 0.08
#works fast for 1 sequence or 50 sequences of 1000bps
> system.time({sequence_kmers(rep(sequence,50),4)})
user system elapsed
3.61 0.00 3.61
#same speed for 3-mers or 13-mers
> system.time({sequence_kmers(rep(sequence,50),13)})
user system elapsed
3.63 0.00 3.62
Tests done with Biostrings
:
#Slow 1 sequence 12-mers
> system.time({oligonucleotideFrequency(DNAString(sequence),12)})
user system elapsed
150.11 1.14 151.37
#Biostrings package freezes for a single sequence of 13-mers
> system.time({oligonucleotideFrequency(sequence,13)})
freezes, used all my 8gb RAM