This is a follow up question from here. I am trying to implement k-means based on this implementation. It works great, but I would like to replace groupByKey(
You could use an aggregateByKey()
(a bit more natural than reduceByKey()
) like this to compute newCentroids
:
val newCentroids = closest.aggregateByKey((Vector.zeros(dim), 0L))(
(agg, v) => (agg._1 += v, agg._2 + 1L),
(agg1, agg2) => (agg1._1 += agg2._1, agg1._2 + agg2._2)
).mapValues(agg => agg._1/agg._2).collectAsMap
For this to work you will need to compute the dimensionality of your data, i.e. dim
, but you only need to do this once. You could probably use something like val dim = data.first._2.length
.