Hierarchical Agglomerative clustering in Spark

问题

I am working on a clustering problem and it has to be scalable for a lot of data. I would like to try hierarchical clustering in Spark and compare my results with other methods.

I have done some research on the web about using hierarchical clustering with Spark but haven't found any promising information.

If anyone has some insight about it, I would be very grateful. Thank you.

回答1:

The Bisecting Kmeans Approach

Seems to do a decent job, and runs quite fast in terms of performance. Here is a sample code I wrote for utilizing the Bisecting-Kmeans algorithm in Spark (scala) to get cluster centers from the Iris Data Set (which many people are familiar with). Note: (I use Spark-Notebook for most of my Spark work, it is very similar to Jupyter Notebooks). I bring this up because you will need to create a Spark SQLContext for this example to work, which may differ based on where or how you are accessing Spark.

You can download the Iris.csv to test here

You can download Spark-Notebook here

It is a great tool, which will easily allow you to run a standalone spark cluster. If you want help with it on linux or Mac, I can provide instructions. Once you download it you need to use SBT to compile it... Use the following commands from the base directory sbt, then run

It will be accessible at localhost:9000

Required Imports

import org.apache.spark.sql.types._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.clustering.BisectingKMeans

Method to create sqlContext in Spark-Notebook

import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Defining Import Schema

val customSchema = StructType(Array(
StructField("c0", IntegerType, true),
StructField("Sepal_Length", DoubleType, true),
StructField("Sepal_Width", DoubleType, true),
StructField("Petal_Length", DoubleType, true),
StructField("Petal_Width", DoubleType, true),
StructField("Species", StringType, true)))

Making the DF

val iris_df = sqlContext.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load("/your/path/to/iris.csv")

Specifying features

val assembler = new 
VectorAssembler().setInputCols(Array("c0","Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width")).setOutputCol("features")
val iris_df_trans = assembler.transform(iris_df)

Model with 3 Clusters (change with .setK)

val bkm = new BisectingKMeans().setK(3).setSeed(1L).setFeaturesCol("features")
val model = bkm.fit(iris_df_trans)

Computing cost

val cost = model.computeCost(iris_df_trans)

Calculating Centers

println(s"Within Set Sum of Squared Errors = $cost")
println("Cluster Centers: ")
val centers = model.clusterCenters
centers.foreach(println)

An Agglomerative Approach

The following provides an Agglomerative hierarchical clustering implementation in Spark which is worth a look, it is not included in the base MLlib like the bisecting Kmeans method and I do not have an example. But it is worth a look for those curious.

Github Project

Youtube of Presentation at Spark-Summit

Slides from Spark-Summit

回答2:

The only thing I was able to find is divisive hierarchical clustering implemented in Spark ML via bisecting k-means (here: https://spark.apache.org/docs/latest/mllib-clustering.html#bisecting-k-means) I am planning to give it a try.

Have you found/tried anything?

来源：https://stackoverflow.com/questions/44152337/hierarchical-agglomerative-clustering-in-spark

标签

apache-spark

cluster-analysis

hierarchical-clustering