Faster way to split a string and count characters using R?

后端 未结 6 738
太阳男子
太阳男子 2021-02-01 08:51

I\'m looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the let

相关标签:
6条回答
  • 2021-02-01 09:04

    Better to not split at all, just count the matches:

    gcCount2 <-  function(line, st, sp){
      sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
    }
    

    That's an order of magnitude faster.

    A small C function that just iterates over the characters would be yet another order of magnitude faster.

    0 讨论(0)
  • 2021-02-01 09:09

    A one liner:

    table(strsplit(toupper(a), '')[[1]])
    
    0 讨论(0)
  • 2021-02-01 09:16

    There's no need to use a loop here.

    Try this:

    gcCount <-  function(line, st, sp){
      chars = strsplit(as.character(line),"")[[1]][st:sp]
      length(which(tolower(chars) == "g" | tolower(chars) == "c"))
    }
    
    0 讨论(0)
  • 2021-02-01 09:17

    Thanks to all for this post,

    To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.

    In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!

    here is the C++ code:

    #include <Rcpp.h>
    using namespace Rcpp;
    // [[Rcpp::export]]
    float pGC_cpp(std::string s) {
      int count = 0;
    
      for (int i = 0; i < s.size(); i++) 
        if (s[i] == 'G') count++;
        else if (s[i] == 'C') count++;
    
      float pGC = (float)count / s.size();
      pGC = pGC * 100;
      return pGC;
    }
    

    Which I call from R typing:

    sourceCpp("pGC_cpp.cpp")
    pGC_cpp("ATGCCC")
    
    0 讨论(0)
  • 2021-02-01 09:23

    I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).

    GC content would be:

    mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
        GC(mysequence)  # 0.4761905
    

    That's from a string, you can also read in a fasta file using "read.fasta()".

    0 讨论(0)
  • 2021-02-01 09:27

    Try this function from stringi package

    > stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
    [1] 3 5
    

    or you can use regex version to count g and G

    > stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
    [1] 12
    

    or you can use tolower function first and then stri_count

    > stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
    [1] "gcccaaaattttccggggcc"
    

    time performance

        > microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
    Unit: microseconds
                                 expr     min     lq  median      uq     max neval
                    gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492   100
                   gcCount2(x, 1, 40)  15.010  16.51  18.312  19.213  40.826   100
     stri_count_regex(x, c("[GgCc]"))  15.610  16.51  18.912  20.112  61.239   100
    

    another example for longer string. stri_dup replicates string n-times

    > stri_dup("abc",3)
    [1] "abcabcabc"
    

    As you can see, for longer sequence stri_count is faster :)

    > y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
        > microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
        Unit: microseconds
                                     expr       min         lq     median        uq       max neval
                  gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828   100
                 gcCount2(y, 1, 40 * 100)   360.225   369.5315   383.6400   399.100   438.274   100
         stri_count_regex(y, c("[GgCc]"))   131.483   137.9370   151.8955   176.511   221.839   100
    
    0 讨论(0)
提交回复
热议问题