Scikit K-means clustering performance measure

前端 未结 3 681
自闭症患者
自闭症患者 2021-02-04 08:42

I\'m trying to do a clustering with K-means method but I would like to measure the performance of my clustering. I\'m not an expert but I am eager to learn more about clustering

3条回答
  •  余生分开走
    2021-02-04 09:06

    Apart from Silhouette Score, Elbow Criterion can be used to evaluate K-Mean clustering. It is not available as a function/method in Scikit-Learn. We need to calculate SSE to evaluate K-Means clustering using Elbow Criterion.

    The idea of the Elbow Criterion method is to choose the k(no of cluster) at which the SSE decreases abruptly. The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid.

    Calculate Sum of Squared Error(SSE) for each value of k, where k is no. of cluster and plot the line graph. SSE tends to decrease toward 0 as we increase k (SSE=0, when k is equal to the no. of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).

    So the goal is to choose a small value of k that still has a low SSE, and the elbow usually represents, where we start to have diminishing returns by increasing k.

    Iris dataset example:

    import pandas as pd
    from sklearn.datasets import load_iris
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    
    iris = load_iris()
    X = pd.DataFrame(iris.data, columns=iris['feature_names'])
    #print(X)
    data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]
    
    sse = {}
    for k in range(1, 10):
        kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
        data["clusters"] = kmeans.labels_
        #print(data["clusters"])
        sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
    plt.figure()
    plt.plot(list(sse.keys()), list(sse.values()))
    plt.xlabel("Number of cluster")
    plt.ylabel("SSE")
    plt.show()
    

    If the line graph looks like an arm - a red circle in above line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster). According to above elbow in line graph, number of optimal cluster is 3.

    Note: Elbow Criterion is heuristic in nature, and may not work for your data set. Follow intuition according to dataset and the problem your are trying to solve.

    Hope it helps!

提交回复
热议问题