matching and counting strings (k-mer of DNA) in R

前端未结

关注

 5  985

深忆病人 2021-02-06 11:55

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k

5条回答

盖世英雄少女心 (楼主)

2021-02-06 12:26

If you are looking for speed the obvious solution is stringi package. There is stri_count_fixed function for counting patterns. And now, check the code and benchmark!

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
colnames(result) <- dna
result
     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0



fstri <- function(x){
    t(sapply(x, stri_count_fixed,dna,T))
}
fbio <- function(x){
    t(sapply(x, function(x){x1 <-  DNAString(x); oligonucleotideFrequency(x1,2)}))
}

all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
[1] TRUE

longDNA <- sample(DNAlst,100,T)
microbenchmark(fstri(longDNA),fbio(longDNA))
Unit: microseconds
           expr        min         lq        mean     median         uq        max neval
 fstri(longDNA)    689.378    738.184    825.3014    766.862    793.134   6027.039   100
  fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294   100
127245.489/766.862
## [1] 165.9301

Ca 165x times faster :)

0 讨论(0)

查看其它5个回答