Fast (< n^2) clustering algorithm

前端未结

关注

 6  798

孤城傲影 2021-01-30 00:34

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be

6条回答

深忆病人 (楼主)

2021-01-30 01:06
Consider an approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). They don't directly solve the clustering problem, but they will be able to tell you which points are "close" to one another. By altering the parameters, you can define close to be as close as you want. And it's fast.

More precisely, LSH can provide a hash function, h, such that, for two points x and y, and distance metric d,
```
d(x,y) <= R1  =>  P(h(x) = h(y)) >= P1
d(x,y) >= R2  =>  P(h(x) = h(y)) <= P2
```
where R1 < R2 and P1 > P2. So yes, it is probabilistic. You can postprocess the retrieved data to arrive at true clusters.

Here is information on LSH including the E2LSH manual. ANN is similar in spirit; David Mount has information here, or try FLANN (has Matlab and Python bindings).
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...