Too many unique sequences

问题

I have a large dataset with above 2 million sequences, including about 180,000 unique ones. I am using the seqdist command to measure distances, and I'll ultimately also try to identify clusters of sequences. Below is the error message I get:

Code and error message

Is there any way of setting a different maximum number of sequences, or some other workaround? Thank you very much in advance!

回答1:

The size limits for the distance matrix follows from the maximum allowed index value. This value is machine dependent.

For huge number n of data, a solution is to select a random representative subset of the sequences, compute the dissimilarities for this subset, and cluster the subset.

If a cluster membership is needed for each individual sequence, you can identify the medoid of each of the clusters obtained from the subset and then assign each individual sequence to the closest medoid. For k clusters, this requires to compute n x k distances instead of the full pairwise matrix.

I illustrate below using the biofam data that ships with TraMineR.

Note that up to version 2.2-0.1, TraMineR tested for the size of the pairwise distance matrix even when refseq was used. This has been fixed in the development version available at https://r-forge.r-project.org/R/?group_id=743.

library(TraMineR)
data(biofam)
b.seq <- seqdef(biofam[, 10:25])

## compute pairwise distances on a random subset
spl <- sample(nrow(b.seq),400)
bs.seq <- b.seq[spl,]
d.lcs <- seqdist(bs.seq, method="LCS", full.matrix=FALSE)

## cluster the random subset
bs.hclust <- hclust(as.dist(d.lcs), method="ward.D")
#plot(bs.hclust, labels=FALSE)
cl <- cutree(bs.hclust,k=4)

## plot clusters for random subset
seqdplot(bs.seq, group=cl, border=NA)

## Medoids of the clusters
c.cl <- disscenter(d.lcs, group=cl, medoids="first")
seqiplot(bs.seq[c.cl,]) # plot of the medoids

## distances to each medoids
dc <- matrix(0,nrow=nrow(b.seq),ncol=length(c.cl))
for (i in 1:length(c.cl)) {
  dc[,i] <- seqdist(b.seq,method="LCS",refseq=spl[c.cl[i]])
}

## cluster membership for the full sequence dataset
##  is for each row the column with the smallest distance
cl.all <- max.col(-dc) 

## now we can plot clusters for the whole dataset
seqdplot(b.seq, group=cl.all, border=NA)

来源：https://stackoverflow.com/questions/62888071/too-many-unique-sequences

标签

traminer