I was surprised to find out that clara
from library(cluster)
allows NAs. But function documentation says nothing about how it handles these values.
By looking at the Clara c code, I noticed that in clara algorithm, when there are missing values in the observations, the sum of squares is "reduced" proportional to the number of missing values, which I think is wrong! line 646 of clara.c is like " dsum *= (nobs / pp) " which shows it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. I think it must be done in other way, i.e. " dsum *= (pp / nobs) ".
Not sure if kmeans
can handle missing data by ignoring the missing values in a row.
There are two steps in kmeans
;
When we have missing data in our observations:
Step 1 can be handled by adjusting the distance metric appropriately as in the clara/pam/daisy
package. But Step 2 can only be performed if we have some value for each column of an observation. Therefore imputing might be the next best option for kmeans
to deal missing data.
Although not stated explicitly, I believe that NA
are handled in the manner described in the ?daisy
help page. The Details section has:
In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.
Given internally the same code will be being used by clara()
that is how I understand that NA
s in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.
Update The C
sources for clara.c
clearly indicate that this (the above) is how NA
s are handled by clara()
(lines 350-356 in ./src/clara.c
):
if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
/* in the following line (Fortran!), x[-2] ==> seg.fault
{BDR to R-core, Sat, 3 Aug 2002} */
if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
continue /* next j */;
}
}