I\'m learning R and I have to cluster numeric data with a timestamp field. One of the parameters is a time, and since the data is strictly day-night dependent, I want to ta
Here is such a mapping of h
to m
where h
is the time in hours (and fraction of an hour). Then we try kmeans
and at least in this test it seems to work:
h <- c(22, 23, 0, 1, 2, 10, 11, 12)
ha <- 2*pi*h/24
m <- cbind(x = sin(ha), y = cos(ha))
kmeans(m, 2)$cluster # compute cluster assignments via kmeans
## [1] 2 2 2 2 2 1 1 1
k-means should use squared Euclidean distance.
But indeed: projecting your data into a meaningful Euclidean space is an easy way to avoid this kind of problems.
However be aware that your mean will no longer lie on the cylinder. In many cases, you can just scale the mean to the desired cylinder. But it might become 0, then no meaningful rescaling is possible.
The other option is kernel k-means. As your desired distance is Euclidean after a data transformation, you can also "kernelize" this transformation, and use kernel k-means. But it may actually be faster to transform your data in your particular case. It will likely only pay off when using much more complex transformations (say, to an infinite dimensional vector space).