matching and counting strings (k-mer of DNA) in R

前端未结

关注

 5  995

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k

The plus side of my code when compared to the others is:

Don't need to install the unimplemented version of stri_count_fixed
Probably stringi package will get really slow for big k-mers since it has to generate all possible combinations for pattern and afterwards, check their existence in the data and count how many times it appears.
It also works for long single and and multiple sequences with the same output really fast.
You can put a value for k instead of creating a pattern string.
If you run oligonucleotideFrequency with a k bigger than 12 in a big sequence, the function freezes for excess of memory use and R is restarted, while with my function it runs pretty fast.

My code

sequence_kmers <- function(sequence, k){
    k_mers <- lapply(sequence,function(x){
        seq_loop_size <- length(DNAString(x))-k+1

        kmers <- sapply(1:seq_loop_size, function(z){
            y <- z + k -1
            kmer <- substr(x=x, start=z, stop=y)
            return(kmer)
        })
        return(kmers)
    })

    uniq <- unique(unlist(k_mers))
    ind <- t(sapply(k_mers, function(x){
        tabulate(match(x, uniq), length(uniq))
    }))
    colnames(ind) <- uniq

    return(ind)
}

I use the Biostringspackage only to count the bases... you can use other options like stringi to count... if you remove all code below k_mers lapply and return(k_mers) it returns just the list... of all k-mers with the respective repeated vectors

`sequence` here is a sequence of 1000bp

#same output for 1 or multiple sequences
> sequence_kmers(sequence,4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG 
   4    4    3    4    4    8    6    4    5    5 
> sequence_kmers(c(sequence,sequence),4)[,1:10]
     GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
[1,]    4    4    3    4    4    8    6    4    5    5
[2,]    4    4    3    4    4    8    6    4    5    5

Tests done with my function:

#super fast for 1 sequence
> system.time({sequence_kmers(sequence,13)})
  usuário   sistema decorrido 
     0.08      0.00      0.08 

#works fast for 1 sequence or 50 sequences of 1000bps
> system.time({sequence_kmers(rep(sequence,50),4)})
     user    system   elapsed
     3.61      0.00      3.61 

#same speed for 3-mers or 13-mers
> system.time({sequence_kmers(rep(sequence,50),13)})
     user    system   elapsed
     3.63      0.00      3.62

Tests done with Biostrings:

#Slow 1 sequence 12-mers
> system.time({oligonucleotideFrequency(DNAString(sequence),12)})
     user    system   elapsed 
   150.11      1.14    151.37 

#Biostrings package freezes for a single sequence of 13-mers
> system.time({oligonucleotideFrequency(sequence,13)})  
freezes, used all my 8gb RAM

0 讨论(0)

梦谈多话

2021-02-06 12:34

May be this helps

 source("http://bioconductor.org/biocLite.R")
 biocLite("Biostrings")
 library(Biostrings)
 t(sapply(DNAlst, function(x){x1 <-  DNAString(x)
                   oligonucleotideFrequency(x1,2)}))
  #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
  #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
  #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
  #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
  #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
  #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0

Or as suggested by @Arun, convert the list to vector first

   oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L)
   #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
   #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
   #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
   #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
   #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
   #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0

0 讨论(0)

一整个雨季

2021-02-06 12:37

Another way to do this:

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA","ACACACACACCA")
len <- 4
stri_sub_fun <- function(x) table(stri_sub(x,1:(stri_length(x)-len+1),length = len))
sapply(DNAlst, stri_sub_fun)
[[1]]

AAAC AACT ACTG ATTT CAAA CTGA GATT TGAT TTTT 
   1    1    1    1    1    1    1    1    1 

[[2]]

AAAA AAAG AAAT AAGT AATA ACCG AGTA ATAC ATGA GAAA GATG GTAA TAAA TACC TGAA 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 

[[3]]

ATGC ATTA TATG TTAT 
   1    1    1    1 

[[4]]

TGGA 
   1 

[[5]]

ATCA CATC CGCA CGCG GCAT GCGC TCAA 
   1    1    1    1    1    1    1 

[[6]]

ACAC ACCA CACA CACC 
   4    1    3    1

0 讨论(0)

梦如初夏

2021-02-06 12:38
We recently released our 'kebabs' package as part of the Bioconductor 3.0 release. Though this package is aimed at providing sequence kernels for classification, regression, and other tasks such as similarity-based clustering, the package includes functionality for computing k-mer frequencies efficiently, too:
```
#installing kebabs:
#source("http://bioconductor.org/biocLite.R")
#biocLite(c("kebabs", "Biostrings"))
library(kebabs)

s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
s1
s2 <- DNAString(paste0(rep(s1, 200), collate=""))
s2

sk13 <- spectrumKernel(k=13, normalized=FALSE)
system.time(kmerFreq <- drop(getExRep(s1, sk13)))
kmerFreq
system.time(kmerFreq <- drop(getExRep(s2, sk13)))
kmerFreq
```
So you see that the k-mer frequencies are obtained as the explicit feature vector of the standard (unnormalized) spectrum kernel with k=13. This function is implemented in highly efficient C++ code that builds up a prefix tree and only considers k-mers that actually occur in the sequence (as you requested). You see that even for k=13 and a sequence with tens of thousands of bases, the computations only take fractions of a second (19 msecs on our 5-year-old Dell server). The above function also works for DNAStringSets, but, in this case, you should remove the drop() to get a matrix of k-mer frequencies. The matrix is by default sparse (class 'dgRMatrix'), but you can also enforce the result to be in standard dense matrix format (however, still omitting k-mers that do not occur at all in any of the sequences):
```
sv <- c(DNAStringSet(s1), DNAStringSet(s2))
system.time(kmerFreq <- getExRep(sv, sk13))
kmerFreq
system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
kmerFreq
```
How long the k-mers may be, may depend on your system. On our system, the limit seems to be k=22 for DNA sequences. The same works for RNA and amino acid sequences. For the latter, however, the limits in terms of k are significantly lower, since the feature space is obviously much larger for the same k.
```
#for the kebabs documentation please see:
browseVignettes("kebabs")
```
I hope that helps. If you have any further questions, please let me know.

Best regards, Ulrich
0 讨论(0)
发布评论:

提交评论
- 加载中...

matching and counting strings (k-mer of DNA) in R

The plus side of my code when compared to the others is:

My code

sequence here is a sequence of 1000bp

`sequence` here is a sequence of 1000bp