Faster way to split a string and count characters using R?

后端未结

关注

 6  739

太阳男子 2021-02-01 08:51

I\'m looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the let

6条回答

醉梦人生 (楼主)

2021-02-01 09:27

Try this function from stringi package

> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5

or you can use regex version to count g and G

> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12

or you can use tolower function first and then stri_count

> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"

time performance

    > microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
                             expr     min     lq  median      uq     max neval
                gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492   100
               gcCount2(x, 1, 40)  15.010  16.51  18.312  19.213  40.826   100
 stri_count_regex(x, c("[GgCc]"))  15.610  16.51  18.912  20.112  61.239   100

another example for longer string. stri_dup replicates string n-times

> stri_dup("abc",3)
[1] "abcabcabc"

As you can see, for longer sequence stri_count is faster :)

> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
    > microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
    Unit: microseconds
                                 expr       min         lq     median        uq       max neval
              gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828   100
             gcCount2(y, 1, 40 * 100)   360.225   369.5315   383.6400   399.100   438.274   100
     stri_count_regex(y, c("[GgCc]"))   131.483   137.9370   151.8955   176.511   221.839   100

0 讨论(0)

查看其它6个回答