matching and counting strings (k-mer of DNA) in R

前端未结

关注

 5  1000

深忆病人 2021-02-06 11:55

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k

5条回答

名媛妹妹 (楼主)

2021-02-06 12:33

My answer wasn't as fast as @bartektartanus. However, it is also pretty fast and I wrote the code... :D

The plus side of my code when compared to the others is:

Don't need to install the unimplemented version of stri_count_fixed
Probably stringi package will get really slow for big k-mers since it has to generate all possible combinations for pattern and afterwards, check their existence in the data and count how many times it appears.
It also works for long single and and multiple sequences with the same output really fast.
You can put a value for k instead of creating a pattern string.
If you run oligonucleotideFrequency with a k bigger than 12 in a big sequence, the function freezes for excess of memory use and R is restarted, while with my function it runs pretty fast.

My code

sequence_kmers <- function(sequence, k){
    k_mers <- lapply(sequence,function(x){
        seq_loop_size <- length(DNAString(x))-k+1

        kmers <- sapply(1:seq_loop_size, function(z){
            y <- z + k -1
            kmer <- substr(x=x, start=z, stop=y)
            return(kmer)
        })
        return(kmers)
    })

    uniq <- unique(unlist(k_mers))
    ind <- t(sapply(k_mers, function(x){
        tabulate(match(x, uniq), length(uniq))
    }))
    colnames(ind) <- uniq

    return(ind)
}

I use the Biostringspackage only to count the bases... you can use other options like stringi to count... if you remove all code below k_mers lapply and return(k_mers) it returns just the list... of all k-mers with the respective repeated vectors

`sequence` here is a sequence of 1000bp

#same output for 1 or multiple sequences
> sequence_kmers(sequence,4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG 
   4    4    3    4    4    8    6    4    5    5 
> sequence_kmers(c(sequence,sequence),4)[,1:10]
     GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
[1,]    4    4    3    4    4    8    6    4    5    5
[2,]    4    4    3    4    4    8    6    4    5    5

Tests done with my function:

#super fast for 1 sequence
> system.time({sequence_kmers(sequence,13)})
  usuário   sistema decorrido 
     0.08      0.00      0.08 

#works fast for 1 sequence or 50 sequences of 1000bps
> system.time({sequence_kmers(rep(sequence,50),4)})
     user    system   elapsed
     3.61      0.00      3.61 

#same speed for 3-mers or 13-mers
> system.time({sequence_kmers(rep(sequence,50),13)})
     user    system   elapsed
     3.63      0.00      3.62

Tests done with Biostrings:

#Slow 1 sequence 12-mers
> system.time({oligonucleotideFrequency(DNAString(sequence),12)})
     user    system   elapsed 
   150.11      1.14    151.37 

#Biostrings package freezes for a single sequence of 13-mers
> system.time({oligonucleotideFrequency(sequence,13)})  
freezes, used all my 8gb RAM

0 讨论(0)

查看其它5个回答

matching and counting strings (k-mer of DNA) in R

The plus side of my code when compared to the others is:

My code

sequence here is a sequence of 1000bp

`sequence` here is a sequence of 1000bp