hclust() in R on large datasets

后端 未结 1 446
别那么骄傲
别那么骄傲 2020-12-03 16:30

I am trying implement hierarchical clustering in R : hclust() ; this requires a distance matrix created by dist() but my dataset has around a million rows, and even EC2 inst

相关标签:
1条回答
  • 2020-12-03 17:05

    One possible solution for this is to sample your data, cluster the smaller sample, then treat the clustered sample as training data for k Nearest Neighbors and "classify" the rest of the data. Here is a quick example with 1.1M rows. I use a sample of 5000 points. The original data is not well-separated, but with only 1/220 of the data, the sample is separated. Since your question referred to hclust, I used that. But you could use other clustering algorithms like dbscan or mean shift.

    ## Generate data
    set.seed(2017)
    x = c(rnorm(250000, 0,0.9), rnorm(350000, 4,1), rnorm(500000, -5,1.1))
    y = c(rnorm(250000, 0,0.9), rnorm(350000, 5.5,1), rnorm(500000,  5,1.1))
    XY = data.frame(x,y)
    Sample5K = sample(length(x), 5000)     ## Downsample
    
    ## Cluster the sample
    DM5K = dist(XY[Sample5K,])
    HC5K = hclust(DM5K, method="single")
    Groups = cutree(HC5K, 8)
    Groups[Groups>4] = 4
    plot(XY[Sample5K,], pch=20, col=rainbow(4, alpha=c(0.2,0.2,0.2,1))[Groups])
    

    Now just assign all other points to the nearest cluster.

    Core = which(Groups<4)
    library(class)
    knnClust = knn(XY[Sample5K[Core], ], XY, Groups[Core])
    plot(XY, pch=20, col=rainbow(3, alpha=0.1)[knnClust])
    

    A few quick notes.

    1. Because I created the data, I knew to choose three clusters. With a real problem, you would have to do the work of figuring out an appropriate number of clusters.
    2. Sampling 1/220 could completely miss any small clusters. In the small sample, they would just look like noise.
    0 讨论(0)
提交回复
热议问题