matching and counting strings (k-mer of DNA) in R

前端 未结 5 983
深忆病人
深忆病人 2021-02-06 11:55

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k

5条回答
  •  梦谈多话
    2021-02-06 12:34

    May be this helps

     source("http://bioconductor.org/biocLite.R")
     biocLite("Biostrings")
     library(Biostrings)
     t(sapply(DNAlst, function(x){x1 <-  DNAString(x)
                       oligonucleotideFrequency(x1,2)}))
      #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
      #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
      #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
      #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
      #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
      #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0
    

    Or as suggested by @Arun, convert the list to vector first

       oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L)
       #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
       #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
       #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
       #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
       #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
       #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0
    

提交回复
热议问题