问题
I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).
What is the optimal way to find the pairwise Hamming distance between these sequences?
Note that for each sequence, I actually want to find the reciprocal of the number of sequences (including itself), where the per-site hamming distance is less than some threshold (0.1 in this example).
So far, I have attempted the following:
library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}
snpdat
is a list
variable where snpdat[[i]]
contains the i
th DNA sequence.
This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram.
I also tried using the stringdist
package, which takes more time to generate the same result.
Any help is highly appreciated!
来源:https://stackoverflow.com/questions/59942412/find-the-hamming-distance-between-string-sequences