matching and counting strings (k-mer of DNA) in R

前端 未结 5 1000
深忆病人
深忆病人 2021-02-06 11:55

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k

5条回答
  •  名媛妹妹
    2021-02-06 12:33

    My answer wasn't as fast as @bartektartanus. However, it is also pretty fast and I wrote the code... :D

    The plus side of my code when compared to the others is:

    1. Don't need to install the unimplemented version of stri_count_fixed
    2. Probably stringi package will get really slow for big k-mers since it has to generate all possible combinations for pattern and afterwards, check their existence in the data and count how many times it appears.
    3. It also works for long single and and multiple sequences with the same output really fast.
    4. You can put a value for k instead of creating a pattern string.
    5. If you run oligonucleotideFrequency with a k bigger than 12 in a big sequence, the function freezes for excess of memory use and R is restarted, while with my function it runs pretty fast.

    My code

    sequence_kmers <- function(sequence, k){
        k_mers <- lapply(sequence,function(x){
            seq_loop_size <- length(DNAString(x))-k+1
    
            kmers <- sapply(1:seq_loop_size, function(z){
                y <- z + k -1
                kmer <- substr(x=x, start=z, stop=y)
                return(kmer)
            })
            return(kmers)
        })
    
        uniq <- unique(unlist(k_mers))
        ind <- t(sapply(k_mers, function(x){
            tabulate(match(x, uniq), length(uniq))
        }))
        colnames(ind) <- uniq
    
        return(ind)
    }
    

    I use the Biostringspackage only to count the bases... you can use other options like stringi to count... if you remove all code below k_mers lapply and return(k_mers) it returns just the list... of all k-mers with the respective repeated vectors

    sequence here is a sequence of 1000bp

    #same output for 1 or multiple sequences
    > sequence_kmers(sequence,4)[,1:10]
    GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG 
       4    4    3    4    4    8    6    4    5    5 
    > sequence_kmers(c(sequence,sequence),4)[,1:10]
         GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
    [1,]    4    4    3    4    4    8    6    4    5    5
    [2,]    4    4    3    4    4    8    6    4    5    5
    

    Tests done with my function:

    #super fast for 1 sequence
    > system.time({sequence_kmers(sequence,13)})
      usuário   sistema decorrido 
         0.08      0.00      0.08 
    
    #works fast for 1 sequence or 50 sequences of 1000bps
    > system.time({sequence_kmers(rep(sequence,50),4)})
         user    system   elapsed
         3.61      0.00      3.61 
    
    #same speed for 3-mers or 13-mers
    > system.time({sequence_kmers(rep(sequence,50),13)})
         user    system   elapsed
         3.63      0.00      3.62 
    

    Tests done with Biostrings:

    #Slow 1 sequence 12-mers
    > system.time({oligonucleotideFrequency(DNAString(sequence),12)})
         user    system   elapsed 
       150.11      1.14    151.37 
    
    #Biostrings package freezes for a single sequence of 13-mers
    > system.time({oligonucleotideFrequency(sequence,13)})  
    freezes, used all my 8gb RAM
    

提交回复
热议问题