Find the Hamming distance between string sequences

问题

I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).

What is the optimal way to find the pairwise Hamming distance between these sequences?

Note that for each sequence, I actually want to find the reciprocal of the number of sequences (including itself), where the per-site hamming distance is less than some threshold (0.1 in this example).

So far, I have attempted the following:

library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
 temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}

snpdat is a list variable where snpdat[[i]] contains the ith DNA sequence.

This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram. I also tried using the stringdist package, which takes more time to generate the same result.

Any help is highly appreciated!

来源：https://stackoverflow.com/questions/59942412/find-the-hamming-distance-between-string-sequences

标签

arrays

string

vectorization

hamming-distance

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!