I\'m looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the let
Better to not split at all, just count the matches:
gcCount2 <- function(line, st, sp){
sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
That's an order of magnitude faster.
A small C function that just iterates over the characters would be yet another order of magnitude faster.
A one liner:
table(strsplit(toupper(a), '')[[1]])
There's no need to use a loop here.
Try this:
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]][st:sp]
length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}
Thanks to all for this post,
To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.
In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!
here is the C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
int count = 0;
for (int i = 0; i < s.size(); i++)
if (s[i] == 'G') count++;
else if (s[i] == 'C') count++;
float pGC = (float)count / s.size();
pGC = pGC * 100;
return pGC;
}
Which I call from R typing:
sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")
I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).
GC content would be:
mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
GC(mysequence) # 0.4761905
That's from a string, you can also read in a fasta file using "read.fasta()".
Try this function from stringi
package
> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5
or you can use regex version to count g and G
> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12
or you can use tolower function first and then stri_count
> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"
time performance
> microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492 100
gcCount2(x, 1, 40) 15.010 16.51 18.312 19.213 40.826 100
stri_count_regex(x, c("[GgCc]")) 15.610 16.51 18.912 20.112 61.239 100
another example for longer string. stri_dup replicates string n-times
> stri_dup("abc",3)
[1] "abcabcabc"
As you can see, for longer sequence stri_count is faster :)
> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
> microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828 100
gcCount2(y, 1, 40 * 100) 360.225 369.5315 383.6400 399.100 438.274 100
stri_count_regex(y, c("[GgCc]")) 131.483 137.9370 151.8955 176.511 221.839 100