问题
I am using H2O (H2O flow, in particular) to do K-means clustering. I selected "standardize" checkbox which makes sure "It standardize columns before computing distances". It trained fine and I investigated the results. It depicts "within_cluster_sum_of_squares" in the result for review. My question is "within_cluster_sum_of_squares" the distance BEFORE or AFTER standardization ? It looks displaying distance after standardization, but the distance I see is big and it seems before standardization (I am not sure though). Any idea ? Thanks.
回答1:
When you select standardize for K-Means in Flow, it does standardize the columns before computing the distances (setting shown below).
So to answer your question the "within_cluster_sum_of_squares" is the distance calculation that is computed after standardization is performed.
One reason your metric value may seem too big could be if you were expecting the H2O-3 Kmeans standardize option to perform normalization (e.g.normalize = x / ||x||) rather than standardization (e.g. standardize = (x - mean) / sd)
From the k-means documentation here is the overview of the standardization option:
standardize: Enable this option to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is enabled by default.
Note: If standardization is enabled, each column of numeric data is centered and scaled so that its mean is zero and its standard deviation is one before the algorithm is used. At the end of the process, the cluster centers on both the standardized scale (centers_std) and the de-standardized scale (centers). To de-standardize the centers, the algorithm multiplies by the original standard deviation of the corresponding column and adds the original mean. Enabling standardization is mathematically equivalent to using h2o.scale in R with center = TRUE and scale = TRUE on the numeric columns. Therefore, there will be no discernible difference if standardization is enabled or not for K-Means, since H2O calculates unstandardized centroids.
来源:https://stackoverflow.com/questions/54087676/h2o-open-source-for-k-mean-clustering