机器学习读书笔记(十一):Clustering Analysis

自古美人都是妖i 提交于 2020-02-01 14:33:47

K-means

some basic words

  1. centroid:质心
  2. medoid:most representative or most frequently occurring points

Steps

  1. Randomly pick kk centroids from the sample points as initial cluster centers
  2. Assign each sample to the nearest centroid μ(j),j{1,...,k}\mu^{(j)},j \in \{ 1,...,k\}
  3. Move the centroids to the center of the samples that were assigned to it.
  4. Repeat the step 2 and 3 until the cluster assignment do not change or a user-defined tolerance or a maximum number of iterations is reached.

SSE

Based on the Euclidean distance metric, we can describe the k-means algorithm as a imple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors(SSE), which is sometimes also called cluster inertia.
d(x,y)2=j=1m(xjyj)2=xy22d(x,y)^2 = \sum_{j=1} ^{m} (x_j -y_j)^2 = ||x - y||_2^2

SSE=i=1nj=1kw(i,j)x(i)μ(j)22SSE = \sum_{i=1}^{n}\sum_{j=1}^{k} w^{(i,j)} ||x^{(i)}-\mu^{(j)} || _2^2
Here, μ(j)\mu^{(j)} is the representative point (centroid) for cluster jj, and w(i,j)w^{(i,j)} =1 if the sample x(i)x^{(i)} is in cluster j and w(i,j)w^{(i,j)} =0 otherwise.

K-means++

Place the initial centroids far away from each other via the k-means++ algorithm.

Steps

  1. Initialize an empty set MM to store the k centroids being selected.
  2. Randomly choose the first centroid μj\mu^j from the input samples and assign it to MM.
  3. For each sample x(i)x^{(i)} that is not in MM , find the minimum squared distance d(x(i),M)2d(x^{(i)},M)^2 to any of the centroids in MM.
  4. To randomly select the next centroid μ(p)\mu^{(p)}, use a weighted probability distribution equal to d(μ(p),M)2id(x(i),M)2\frac {d(\mu^{(p)},M)^2} {\sum_i d(x^{(i)},M)^2}
  5. Repeat steps 2 and 3 until k centroids are chosen.
  6. Proceed with the classic k-means algorithm.

Show

from sklearn.datasets import make_blobs
#y实际上并没有用到,因为我们此时想要的是一个无监督学习模型,不过可以用于结果验证
X, y = make_blobs(n_samples=150, 
                  n_features=2, 
                  centers=3, 
                  cluster_std=0.5, 
                  shuffle=True, 
                  random_state=0)
import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], 
            c='white', marker='o', edgecolor='black', s=50)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_01.png', dpi=300)
plt.show()

在这里插入图片描述

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, 
            init='k-means++', 
            n_init=10, 
            max_iter=300,
            tol=1e-04,
            random_state=0)

y_km = km.fit_predict(X)
plt.scatter(X[y_km == 0, 0],
            X[y_km == 0, 1],
            s=50, c='lightgreen',
            marker='s', edgecolor='black',
            label='cluster 1')
plt.scatter(X[y_km == 1, 0],
            X[y_km == 1, 1],
            s=50, c='orange',
            marker='o', edgecolor='black',
            label='cluster 2')
plt.scatter(X[y_km == 2, 0],
            X[y_km == 2, 1],
            s=50, c='lightblue',
            marker='v', edgecolor='black',
            label='cluster 3')
plt.scatter(km.cluster_centers_[:, 0],
            km.cluster_centers_[:, 1],
            s=250, marker='*',
            c='red', edgecolor='black',
            label='centroids')
plt.legend(scatterpoints=1)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_02.png', dpi=300)
plt.show()

在这里插入图片描述
The above algorithms are called hard clustering, where each sample are assigned to exactly one cluster.

Soft clustering (FCM)

also called fuzzy clustering, e.g . fuzzy C-means (FCM)
We replace the hard cluster assignment by probabilities for each point belonging to each cluster.

Steps

  1. Specify the number of k centroids and randomly assign the cluster memberships for each point.
  2. Compute the cluster centroids μ(j),j{1,...,k}\mu^{(j)},j \in \{ 1,...,k\}
  3. Update the cluster memberships for each point.
  4. Repeat the step 2 and 3 until the membership coefficients do not change or a user-defined tolerance or a maximum number of iterations is reached.
    Jm=i=1nj=1kwm(i,j)x(i)μ(j)22,m[1,)J_m = \sum_{i=1}^{n}\sum_{j=1}^{k} w^{m(i,j)} ||x^{(i)}-\mu^{(j)} || _2^2,m \in[1,\infty)
    Here, the menbership indicator wm(i,j)w^{m(i,j)} is not a binary value as in k-means but a real value that denotes the cluster membership probability.
    The exponent mm, is the so-called fuzziness coefficient, (or simply fuzzifier) that control the fuzziness. The larger the value of mm, the fuzzier the clusters.
    w(i,j)=[p=1k(x(i)μ(j)2x(i)μ(p)2)2m1]1w^{(i,j)} = \left[ \sum_{p=1}^{k} \left( \frac {||x^{(i)}-\mu^{(j)} ||_2} {||x^{(i)}-\mu^{(p)} ||_2} \right)^{\frac{2}{m-1}} \right]^{-1}
    The center μ(j)\mu^{(j)} of a cluster itself is calculated as the mean of all samples in the cluster weighted by the membership degree of belonging to its own cluster:
    μ(j)=i=1nwm(i,j)xii=1nwm(i,j)\mu^{(j)} = \frac {\sum _{i=1} ^n w^{m(i,j)}x^{{i}} } {\sum_{i=1}^{n}w^{m(i,j)}}

Elbow method

distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i, 
                init='k-means++', 
                n_init=10, 
                max_iter=300, 
                random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
#plt.savefig('images/11_03.png', dpi=300)
plt.show()

在这里插入图片描述

Silhouette plots 轮廓

Silhouette analysis can be used as a graphical tool to plot a measure of how tightly grouped the samples in the clusters are.

Steps

  1. Calculate the cluster cohesion a(i)a^{(i)} as the average distance between a sample x(i)x^{(i)} and all other points in the same cluster.
  2. Calcutale the cluster seperation b(i)b^{(i)} from the next closet clusteras the average distance between the sample x(i)x^{(i)} and all samples in the nearest cluster.
  3. Calculate the silhouette s(i)s^{(i)} as the difference between cluster cohesion and separation divided by the greater of the two , as shown here:
    s(i)=b(i)a(i)max{b(i),a(i)}s^{(i)} = \frac {b^{(i)}-a^{(i)}} {max\{ b^{(i)}, a^{(i)}\}}
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_samples

km = KMeans(n_clusters=3, 
            init='k-means++', 
            n_init=10, 
            max_iter=300,
            tol=1e-04,
            random_state=0)
y_km = km.fit_predict(X)

cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
for i, c in enumerate(cluster_labels):
    #取出一类中的silhouette值并排序
    c_silhouette_vals = silhouette_vals[y_km == c]
    c_silhouette_vals.sort()
    #作图
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, 
             edgecolor='none', color=color)
    #记下y轴坐标位置
    yticks.append((y_ax_lower + y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)
#画一个均值线
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color="red", linestyle="--") 

plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')

plt.tight_layout()
#plt.savefig('images/11_04.png', dpi=300)
plt.show()

在这里插入图片描述
s 越高区分度度越好, s in range(-1,1),可以作为建模完成后的升华

hierarchical clustering

adavantages:

  1. it allows us to plot dendrograms(visualizations of a binary hierarchical clustering)#系统树图
  2. We do not need to specify the number oif clusters upfront.

approaches:

  1. agglomerative hierarchical clustering (bottom-up)
  2. divisive hierarchical clustering

two algorithm for agglomerative hierarchical clustering:

  1. single linkage
  2. complete linkage
    在这里插入图片描述

Steps

  1. Compute the distance matrix of all samples
  2. Represent each data point as a singleton cluster
  3. Merge the two closest clusters based on the distance of the most dissimilar members
  4. Update the similarity matrix
  5. Repeat steps 2 to 4 until one single cluster remains
import pandas as pd
import numpy as np

np.random.seed(123)

variables = ['X', 'Y', 'Z']
labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4']

X = np.random.random_sample([5, 3])*10
df = pd.DataFrame(X, columns=variables, index=labels)
df

在这里插入图片描述

from scipy.spatial.distance import pdist, squareform

row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')),
                        columns=labels,
                        index=labels)
row_dist

在这里插入图片描述

# 2. correct approach: Condensed distance matrix

row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
             index=['cluster %d' % (i + 1) 
                    for i in range(row_clusters.shape[0])])
from scipy.cluster.hierarchy import dendrogram

# make dendrogram black (part 1/2)
# from scipy.cluster.hierarchy import set_link_color_palette
# set_link_color_palette(['black'])

row_dendr = dendrogram(row_clusters, 
                       labels=labels,
                       # make dendrogram black (part 2/2)
                       # color_threshold=np.inf
                       )
plt.tight_layout()
plt.ylabel('Euclidean distance')
#plt.savefig('images/11_11.png', dpi=300, 
#            bbox_inches='tight')
plt.show()

在这里插入图片描述

dendrogram with heatmap

# plot row dendrogram
fig = plt.figure(figsize=(8, 8), facecolor='white')
axd = fig.add_axes([0.09, 0.1, 0.2, 0.6])

# note: for matplotlib < v1.5.1, please use orientation='right'
row_dendr = dendrogram(row_clusters, orientation='left')

# reorder data with respect to clustering
df_rowclust = df.iloc[row_dendr['leaves'][::-1]]

axd.set_xticks([])
axd.set_yticks([])

# remove axes spines from dendrogram
for i in axd.spines.values():
    i.set_visible(False)

# plot heatmap
axm = fig.add_axes([0.23, 0.1, 0.6, 0.6])  # x-pos, y-pos, width, height
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([''] + list(df_rowclust.index))

#plt.savefig('images/11_12.png', dpi=300)
plt.show()

在这里插入图片描述

sklearn

from sklearn.cluster import AgglomerativeClustering

ac = AgglomerativeClustering(n_clusters=3, 
                             affinity='euclidean', 
                             linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)
-------------------------------------------------
Cluster labels: [1 0 0 2 1]

DBSCAN

DBSCAN: Density-based Spatial Clustering of Applications with Noise
Criteria:

1. core point: a point is considered as core point if at least a specified number (MinPts) of neighboring points fall within the specified radius ε\varepsilon.
2.border point: is a point that has fewer neighbors than MinPts within ε\varepsilon.
3. noise point: all other points that are neither core nor border points are considered noise points.
在这里插入图片描述

steps

  1. Form a seperate cluster for each core point or a connected group of core points (core points are connected if they are no farther away than ε\varepsilon)
  2. Assign each boarder point to the cluster of its corresponding core point.
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
plt.scatter(X[:, 0], X[:, 1])
plt.tight_layout()
#plt.savefig('images/11_14.png', dpi=600)
plt.show()

在这里插入图片描述

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))

km = KMeans(n_clusters=2, random_state=0)
y_km = km.fit_predict(X)
ax1.scatter(X[y_km == 0, 0], X[y_km == 0, 1],
            edgecolor='black',
            c='lightblue', marker='o', s=40, label='cluster 1')
ax1.scatter(X[y_km == 1, 0], X[y_km == 1, 1],
            edgecolor='black',
            c='red', marker='s', s=40, label='cluster 2')
ax1.set_title('K-means clustering')

ac = AgglomerativeClustering(n_clusters=2,
                             affinity='euclidean',
                             linkage='complete')
y_ac = ac.fit_predict(X)
ax2.scatter(X[y_ac == 0, 0], X[y_ac == 0, 1], c='lightblue',
            edgecolor='black',
            marker='o', s=40, label='cluster 1')
ax2.scatter(X[y_ac == 1, 0], X[y_ac == 1, 1], c='red',
            edgecolor='black',
            marker='s', s=40, label='cluster 2')
ax2.set_title('Agglomerative clustering')

plt.legend()
plt.tight_layout()
# plt.savefig('images/11_15.png', dpi=300)
plt.show()

在这里插入图片描述

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean')
y_db = db.fit_predict(X)
plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1],
            c='lightblue', marker='o', s=40,
            edgecolor='black', 
            label='cluster 1')
plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1],
            c='red', marker='s', s=40,
            edgecolor='black', 
            label='cluster 2')
plt.legend()
plt.tight_layout()
#plt.savefig('images/11_16.png', dpi=300)
plt.show()

在这里插入图片描述
Thank you for your precise time and patience!

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!