matching and counting strings (k-mer of DNA) in R

前端未结

关注

 5  999

深忆病人 2021-02-06 11:55

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k

5条回答

梦如初夏 (楼主)

2021-02-06 12:38
We recently released our 'kebabs' package as part of the Bioconductor 3.0 release. Though this package is aimed at providing sequence kernels for classification, regression, and other tasks such as similarity-based clustering, the package includes functionality for computing k-mer frequencies efficiently, too:
```
#installing kebabs:
#source("http://bioconductor.org/biocLite.R")
#biocLite(c("kebabs", "Biostrings"))
library(kebabs)

s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
s1
s2 <- DNAString(paste0(rep(s1, 200), collate=""))
s2

sk13 <- spectrumKernel(k=13, normalized=FALSE)
system.time(kmerFreq <- drop(getExRep(s1, sk13)))
kmerFreq
system.time(kmerFreq <- drop(getExRep(s2, sk13)))
kmerFreq
```
So you see that the k-mer frequencies are obtained as the explicit feature vector of the standard (unnormalized) spectrum kernel with k=13. This function is implemented in highly efficient C++ code that builds up a prefix tree and only considers k-mers that actually occur in the sequence (as you requested). You see that even for k=13 and a sequence with tens of thousands of bases, the computations only take fractions of a second (19 msecs on our 5-year-old Dell server). The above function also works for DNAStringSets, but, in this case, you should remove the drop() to get a matrix of k-mer frequencies. The matrix is by default sparse (class 'dgRMatrix'), but you can also enforce the result to be in standard dense matrix format (however, still omitting k-mers that do not occur at all in any of the sequences):
```
sv <- c(DNAStringSet(s1), DNAStringSet(s2))
system.time(kmerFreq <- getExRep(sv, sk13))
kmerFreq
system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
kmerFreq
```
How long the k-mers may be, may depend on your system. On our system, the limit seems to be k=22 for DNA sequences. The same works for RNA and amino acid sequences. For the latter, however, the limits in terms of k are significantly lower, since the feature space is obviously much larger for the same k.
```
#for the kebabs documentation please see:
browseVignettes("kebabs")
```
I hope that helps. If you have any further questions, please let me know.

Best regards, Ulrich
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...