clustering with NA values in R

后端未结

关注

 3  2075

I was surprised to find out that clara from library(cluster) allows NAs. But function documentation says nothing about how it handles these values.

相关标签:

3条回答

执念已碎

2021-02-13 20:26

By looking at the Clara c code, I noticed that in clara algorithm, when there are missing values in the observations, the sum of squares is "reduced" proportional to the number of missing values, which I think is wrong! line 646 of clara.c is like " dsum *= (nobs / pp) " which shows it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. I think it must be done in other way, i.e. " dsum *= (pp / nobs) ".

0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2021-02-13 20:36
Not sure if kmeans can handle missing data by ignoring the missing values in a row.

There are two steps in kmeans;
1. calculating the distance between an observation and original cluster mean.
2. updating the new cluster mean based on the newly calculated distances.
When we have missing data in our observations: Step 1 can be handled by adjusting the distance metric appropriately as in the clara/pam/daisy package. But Step 2 can only be performed if we have some value for each column of an observation. Therefore imputing might be the next best option for kmeans to deal missing data.
0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2021-02-13 20:39
Although not stated explicitly, I believe that NA are handled in the manner described in the ?daisy help page. The Details section has:

In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.

Given internally the same code will be being used by clara() that is how I understand that NAs in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.

Update The C sources for clara.c clearly indicate that this (the above) is how NAs are handled by clara() (lines 350-356 in ./src/clara.c):
```
    if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
        /* in the following line (Fortran!), x[-2] ==> seg.fault
           {BDR to R-core, Sat, 3 Aug 2002} */
        if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
        continue /* next j */;
        }
    }
```
0 讨论(0)
发布评论:

提交评论
- 加载中...