I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k
If you are looking for speed the obvious solution is stringi package.
There is stri_count_fixed
function for counting patterns.
And now, check the code and benchmark!
DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
colnames(result) <- dna
result
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3
[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0
[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1
[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0
[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0
fstri <- function(x){
t(sapply(x, stri_count_fixed,dna,T))
}
fbio <- function(x){
t(sapply(x, function(x){x1 <- DNAString(x); oligonucleotideFrequency(x1,2)}))
}
all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
[1] TRUE
longDNA <- sample(DNAlst,100,T)
microbenchmark(fstri(longDNA),fbio(longDNA))
Unit: microseconds
expr min lq mean median uq max neval
fstri(longDNA) 689.378 738.184 825.3014 766.862 793.134 6027.039 100
fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294 100
127245.489/766.862
## [1] 165.9301
Ca 165x times faster :)