问题
I have a large dataset with above 2 million sequences, including about 180,000 unique ones. I am using the seqdist
command to measure distances, and I'll ultimately also try to identify clusters of sequences. Below is the error message I get:
Code and error message
Is there any way of setting a different maximum number of sequences, or some other workaround? Thank you very much in advance!
回答1:
The size limits for the distance matrix follows from the maximum allowed index value. This value is machine dependent.
For huge number n of data, a solution is to select a random representative subset of the sequences, compute the dissimilarities for this subset, and cluster the subset.
If a cluster membership is needed for each individual sequence, you can identify the medoid of each of the clusters obtained from the subset and then assign each individual sequence to the closest medoid. For k clusters, this requires to compute n x k distances instead of the full pairwise matrix.
I illustrate below using the biofam
data that ships with TraMineR.
Note that up to version 2.2-0.1, TraMineR tested for the size of the pairwise distance matrix even when refseq
was used. This has been fixed in the development version available at https://r-forge.r-project.org/R/?group_id=743.
library(TraMineR)
data(biofam)
b.seq <- seqdef(biofam[, 10:25])
## compute pairwise distances on a random subset
spl <- sample(nrow(b.seq),400)
bs.seq <- b.seq[spl,]
d.lcs <- seqdist(bs.seq, method="LCS", full.matrix=FALSE)
## cluster the random subset
bs.hclust <- hclust(as.dist(d.lcs), method="ward.D")
#plot(bs.hclust, labels=FALSE)
cl <- cutree(bs.hclust,k=4)
## plot clusters for random subset
seqdplot(bs.seq, group=cl, border=NA)
## Medoids of the clusters
c.cl <- disscenter(d.lcs, group=cl, medoids="first")
seqiplot(bs.seq[c.cl,]) # plot of the medoids
## distances to each medoids
dc <- matrix(0,nrow=nrow(b.seq),ncol=length(c.cl))
for (i in 1:length(c.cl)) {
dc[,i] <- seqdist(b.seq,method="LCS",refseq=spl[c.cl[i]])
}
## cluster membership for the full sequence dataset
## is for each row the column with the smallest distance
cl.all <- max.col(-dc)
## now we can plot clusters for the whole dataset
seqdplot(b.seq, group=cl.all, border=NA)
来源:https://stackoverflow.com/questions/62888071/too-many-unique-sequences