matching and counting strings (k-mer of DNA) in R

前端 未结 5 995
深忆病人
深忆病人 2021-02-06 11:55

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k

相关标签:
5条回答
  • 2021-02-06 12:26

    If you are looking for speed the obvious solution is stringi package. There is stri_count_fixed function for counting patterns. And now, check the code and benchmark!

    DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
    dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
    result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
    colnames(result) <- dna
    result
         AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
    [1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
    [2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
    [3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
    [4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
    [5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0
    
    
    
    fstri <- function(x){
        t(sapply(x, stri_count_fixed,dna,T))
    }
    fbio <- function(x){
        t(sapply(x, function(x){x1 <-  DNAString(x); oligonucleotideFrequency(x1,2)}))
    }
    
    all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
    [1] TRUE
    
    longDNA <- sample(DNAlst,100,T)
    microbenchmark(fstri(longDNA),fbio(longDNA))
    Unit: microseconds
               expr        min         lq        mean     median         uq        max neval
     fstri(longDNA)    689.378    738.184    825.3014    766.862    793.134   6027.039   100
      fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294   100
    127245.489/766.862
    ## [1] 165.9301
    

    Ca 165x times faster :)

    0 讨论(0)
  • 2021-02-06 12:33

    My answer wasn't as fast as @bartektartanus. However, it is also pretty fast and I wrote the code... :D

    The plus side of my code when compared to the others is:

    1. Don't need to install the unimplemented version of stri_count_fixed
    2. Probably stringi package will get really slow for big k-mers since it has to generate all possible combinations for pattern and afterwards, check their existence in the data and count how many times it appears.
    3. It also works for long single and and multiple sequences with the same output really fast.
    4. You can put a value for k instead of creating a pattern string.
    5. If you run oligonucleotideFrequency with a k bigger than 12 in a big sequence, the function freezes for excess of memory use and R is restarted, while with my function it runs pretty fast.

    My code

    sequence_kmers <- function(sequence, k){
        k_mers <- lapply(sequence,function(x){
            seq_loop_size <- length(DNAString(x))-k+1
    
            kmers <- sapply(1:seq_loop_size, function(z){
                y <- z + k -1
                kmer <- substr(x=x, start=z, stop=y)
                return(kmer)
            })
            return(kmers)
        })
    
        uniq <- unique(unlist(k_mers))
        ind <- t(sapply(k_mers, function(x){
            tabulate(match(x, uniq), length(uniq))
        }))
        colnames(ind) <- uniq
    
        return(ind)
    }
    

    I use the Biostringspackage only to count the bases... you can use other options like stringi to count... if you remove all code below k_mers lapply and return(k_mers) it returns just the list... of all k-mers with the respective repeated vectors

    sequence here is a sequence of 1000bp

    #same output for 1 or multiple sequences
    > sequence_kmers(sequence,4)[,1:10]
    GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG 
       4    4    3    4    4    8    6    4    5    5 
    > sequence_kmers(c(sequence,sequence),4)[,1:10]
         GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
    [1,]    4    4    3    4    4    8    6    4    5    5
    [2,]    4    4    3    4    4    8    6    4    5    5
    

    Tests done with my function:

    #super fast for 1 sequence
    > system.time({sequence_kmers(sequence,13)})
      usuário   sistema decorrido 
         0.08      0.00      0.08 
    
    #works fast for 1 sequence or 50 sequences of 1000bps
    > system.time({sequence_kmers(rep(sequence,50),4)})
         user    system   elapsed
         3.61      0.00      3.61 
    
    #same speed for 3-mers or 13-mers
    > system.time({sequence_kmers(rep(sequence,50),13)})
         user    system   elapsed
         3.63      0.00      3.62 
    

    Tests done with Biostrings:

    #Slow 1 sequence 12-mers
    > system.time({oligonucleotideFrequency(DNAString(sequence),12)})
         user    system   elapsed 
       150.11      1.14    151.37 
    
    #Biostrings package freezes for a single sequence of 13-mers
    > system.time({oligonucleotideFrequency(sequence,13)})  
    freezes, used all my 8gb RAM
    
    0 讨论(0)
  • 2021-02-06 12:34

    May be this helps

     source("http://bioconductor.org/biocLite.R")
     biocLite("Biostrings")
     library(Biostrings)
     t(sapply(DNAlst, function(x){x1 <-  DNAString(x)
                       oligonucleotideFrequency(x1,2)}))
      #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
      #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
      #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
      #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
      #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
      #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0
    

    Or as suggested by @Arun, convert the list to vector first

       oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L)
       #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
       #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
       #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
       #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
       #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
       #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0
    
    0 讨论(0)
  • 2021-02-06 12:37

    Another way to do this:

    DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA","ACACACACACCA")
    len <- 4
    stri_sub_fun <- function(x) table(stri_sub(x,1:(stri_length(x)-len+1),length = len))
    sapply(DNAlst, stri_sub_fun)
    [[1]]
    
    AAAC AACT ACTG ATTT CAAA CTGA GATT TGAT TTTT 
       1    1    1    1    1    1    1    1    1 
    
    [[2]]
    
    AAAA AAAG AAAT AAGT AATA ACCG AGTA ATAC ATGA GAAA GATG GTAA TAAA TACC TGAA 
       1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
    
    [[3]]
    
    ATGC ATTA TATG TTAT 
       1    1    1    1 
    
    [[4]]
    
    TGGA 
       1 
    
    [[5]]
    
    ATCA CATC CGCA CGCG GCAT GCGC TCAA 
       1    1    1    1    1    1    1 
    
    [[6]]
    
    ACAC ACCA CACA CACC 
       4    1    3    1 
    
    0 讨论(0)
  • 2021-02-06 12:38

    We recently released our 'kebabs' package as part of the Bioconductor 3.0 release. Though this package is aimed at providing sequence kernels for classification, regression, and other tasks such as similarity-based clustering, the package includes functionality for computing k-mer frequencies efficiently, too:

    #installing kebabs:
    #source("http://bioconductor.org/biocLite.R")
    #biocLite(c("kebabs", "Biostrings"))
    library(kebabs)
    
    s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
    s1
    s2 <- DNAString(paste0(rep(s1, 200), collate=""))
    s2
    
    sk13 <- spectrumKernel(k=13, normalized=FALSE)
    system.time(kmerFreq <- drop(getExRep(s1, sk13)))
    kmerFreq
    system.time(kmerFreq <- drop(getExRep(s2, sk13)))
    kmerFreq
    

    So you see that the k-mer frequencies are obtained as the explicit feature vector of the standard (unnormalized) spectrum kernel with k=13. This function is implemented in highly efficient C++ code that builds up a prefix tree and only considers k-mers that actually occur in the sequence (as you requested). You see that even for k=13 and a sequence with tens of thousands of bases, the computations only take fractions of a second (19 msecs on our 5-year-old Dell server). The above function also works for DNAStringSets, but, in this case, you should remove the drop() to get a matrix of k-mer frequencies. The matrix is by default sparse (class 'dgRMatrix'), but you can also enforce the result to be in standard dense matrix format (however, still omitting k-mers that do not occur at all in any of the sequences):

    sv <- c(DNAStringSet(s1), DNAStringSet(s2))
    system.time(kmerFreq <- getExRep(sv, sk13))
    kmerFreq
    system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
    kmerFreq
    

    How long the k-mers may be, may depend on your system. On our system, the limit seems to be k=22 for DNA sequences. The same works for RNA and amino acid sequences. For the latter, however, the limits in terms of k are significantly lower, since the feature space is obviously much larger for the same k.

    #for the kebabs documentation please see:
    browseVignettes("kebabs")
    

    I hope that helps. If you have any further questions, please let me know.

    Best regards, Ulrich

    0 讨论(0)
提交回复
热议问题