Find the Hamming distance between string sequences
问题 I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap). What is the optimal way to find the pairwise Hamming distance between these sequences? Note that for each sequence, I actually want to find the reciprocal of the number of sequences (including itself), where the per-site hamming distance is less than some threshold (0.1 in this example). So far, I have attempted the following: library(doParallel)