Too many unique sequences

蹲街弑〆低调 提交于 2021-02-10 20:17:17

问题


I have a large dataset with above 2 million sequences, including about 180,000 unique ones. I am using the seqdist command to measure distances, and I'll ultimately also try to identify clusters of sequences. Below is the error message I get:

Code and error message

Is there any way of setting a different maximum number of sequences, or some other workaround? Thank you very much in advance!


回答1:


The size limits for the distance matrix follows from the maximum allowed index value. This value is machine dependent.

For huge number n of data, a solution is to select a random representative subset of the sequences, compute the dissimilarities for this subset, and cluster the subset.

If a cluster membership is needed for each individual sequence, you can identify the medoid of each of the clusters obtained from the subset and then assign each individual sequence to the closest medoid. For k clusters, this requires to compute n x k distances instead of the full pairwise matrix.

I illustrate below using the biofam data that ships with TraMineR.

Note that up to version 2.2-0.1, TraMineR tested for the size of the pairwise distance matrix even when refseq was used. This has been fixed in the development version available at https://r-forge.r-project.org/R/?group_id=743.

library(TraMineR)
data(biofam)
b.seq <- seqdef(biofam[, 10:25])

## compute pairwise distances on a random subset
spl <- sample(nrow(b.seq),400)
bs.seq <- b.seq[spl,]
d.lcs <- seqdist(bs.seq, method="LCS", full.matrix=FALSE)

## cluster the random subset
bs.hclust <- hclust(as.dist(d.lcs), method="ward.D")
#plot(bs.hclust, labels=FALSE)
cl <- cutree(bs.hclust,k=4)

## plot clusters for random subset
seqdplot(bs.seq, group=cl, border=NA)

## Medoids of the clusters
c.cl <- disscenter(d.lcs, group=cl, medoids="first")
seqiplot(bs.seq[c.cl,]) # plot of the medoids

## distances to each medoids
dc <- matrix(0,nrow=nrow(b.seq),ncol=length(c.cl))
for (i in 1:length(c.cl)) {
  dc[,i] <- seqdist(b.seq,method="LCS",refseq=spl[c.cl[i]])
}

## cluster membership for the full sequence dataset
##  is for each row the column with the smallest distance
cl.all <- max.col(-dc) 

## now we can plot clusters for the whole dataset
seqdplot(b.seq, group=cl.all, border=NA)


来源:https://stackoverflow.com/questions/62888071/too-many-unique-sequences

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!