I am working on a clustering problem and it has to be scalable for a lot of data. I would like to try hierarchical clustering in Spark and compare my results with other methods.
I have done some research on the web about using hierarchical clustering with Spark but haven't found any promising information.
If anyone has some insight about it, I would be very grateful. Thank you.
The Bisecting Kmeans Approach
Seems to do a decent job, and runs quite fast in terms of performance. Here is a sample code I wrote for utilizing the Bisecting-Kmeans algorithm in Spark (scala) to get cluster centers from the Iris Data Set (which many people are familiar with). Note: (I use Spark-Notebook for most of my Spark work, it is very similar to Jupyter Notebooks). I bring this up because you will need to create a Spark SQLContext for this example to work, which may differ based on where or how you are accessing Spark.
You can download the Iris.csv to test here
You can download Spark-Notebook here
It is a great tool, which will easily allow you to run a standalone spark cluster. If you want help with it on linux or Mac, I can provide instructions. Once you download it you need to use SBT to compile it... Use the following commands from the base directory sbt
, then run
It will be accessible at localhost:9000
Required Imports
import org.apache.spark.sql.types._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.clustering.BisectingKMeans
Method to create sqlContext in Spark-Notebook
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Defining Import Schema
val customSchema = StructType(Array(
StructField("c0", IntegerType, true),
StructField("Sepal_Length", DoubleType, true),
StructField("Sepal_Width", DoubleType, true),
StructField("Petal_Length", DoubleType, true),
StructField("Petal_Width", DoubleType, true),
StructField("Species", StringType, true)))
Making the DF
val iris_df = sqlContext.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load("/your/path/to/iris.csv")
Specifying features
val assembler = new
VectorAssembler().setInputCols(Array("c0","Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width")).setOutputCol("features")
val iris_df_trans = assembler.transform(iris_df)
Model with 3 Clusters (change with .setK)
val bkm = new BisectingKMeans().setK(3).setSeed(1L).setFeaturesCol("features")
val model = bkm.fit(iris_df_trans)
Computing cost
val cost = model.computeCost(iris_df_trans)
Calculating Centers
println(s"Within Set Sum of Squared Errors = $cost")
println("Cluster Centers: ")
val centers = model.clusterCenters
centers.foreach(println)
An Agglomerative Approach
The following provides an Agglomerative hierarchical clustering implementation in Spark which is worth a look, it is not included in the base MLlib like the bisecting Kmeans method and I do not have an example. But it is worth a look for those curious.
The only thing I was able to find is divisive hierarchical clustering implemented in Spark ML via bisecting k-means (here: https://spark.apache.org/docs/latest/mllib-clustering.html#bisecting-k-means) I am planning to give it a try.
Have you found/tried anything?
来源:https://stackoverflow.com/questions/44152337/hierarchical-agglomerative-clustering-in-spark