问题
I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I don't know how to fix it. This is my a part of my code that runs on sentences:
vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1)
X = vectorizer.fit_transform(data)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=5)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)
spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize")
spectral.fit(X)
Data is a list of sentences. Everytime the code runs, my clustering results differs. How can I get consistent results using Spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:
vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore")
X_data = vectorizer.fit_transform(data)
km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0)
km.fit(X_data)
I appreciate your helps.
回答1:
When using k-means, you want to set the random_state
parameter in KMeans
(see the documentation). Set this to either an int or a RandomState instance.
km = KMeans(n_clusters=number_of_k, init='k-means++',
max_iter=100, n_init=1, verbose=0, random_state=3425)
km.fit(X_data)
This is important because k-means is not a deterministic algorithm. It usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. Seeding the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.
I'm not sure about the spectral clustering example though. From the documentation on the random_state
parameter: "A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg'
and by the K-Means initialization." OP's code doesn't seem to be contained in those cases, though setting the parameter might be worth a shot.
回答2:
As the others already noted, k-means is usually implemented with randomized initialization. It is intentional that you can get different results.
The algorithm is only a heuristic. It may yield suboptimal results. Running it multiple times gives you a better chance of finding a good result.
In my opinion, when the results vary highly from run to run, this indicates that the data just does not cluster well with k-means at all. Your results are not much better than random in such a case. If the data is really suited for k-means clustering, the results will be rather stable! If they vary, the clusters may not have the same size, or may be not well separated; and other algorithms may yield better results.
回答3:
I had a similar issue, but it's that I wanted the data set from another distribution to be clustered the same way as the original data set. For example, all color images of the original data set were in the cluster 0
and all gray images of the original data set were in the cluster 1
. For another data set, I want color images / gray images to be in cluster 0
and cluster 1
as well.
Here is the code I stole from a Kaggler - in addition to set the random_state
to a seed, you use the k-mean model returned by KMeans
for clustering the other data set. This works reasonably well. However, I can't find the official scikit-Learn
document saying that.
# reference - https://www.kaggle.com/kmader/normalizing-brightfield-stained-and-fluorescence
from sklearn.cluster import KMeans
seed = 42
def create_color_clusters(img_df, cluster_count = 2, cluster_maker=None):
if cluster_maker is None:
cluster_maker = KMeans(cluster_count, random_state=seed)
cluster_maker.fit(img_df[['Green', 'Red-Green', 'Red-Green-Sd']])
img_df['cluster-id'] = np.argmin(cluster_maker.transform(img_df[['Green', 'Red-Green', 'Red-Green-Sd']]),-1)
return img_df, cluster_maker
# Now K-Mean your images `img_df` to two clusters
img_df, cluster_maker = create_color_clusters(img_df, 2)
# Cluster another set of images using the same kmean-model
another_img_df, _ = create_color_clusters(another_img_df, 2, cluster_maker)
However, even setting random_state
to a int seed
cannot ensure the same data will always be grouped in the same order across machines. The same data may be clustered as group 0
on one machine and clustered as group 1
on another machine. But at least with the same K-Means model (cluster_maker
in my code) we make sure data from another distribution will be clustered in the same way as the original data set.
回答4:
Typically when running algorithms with many local minima it's common to take a stochastic approach and run the algorithm many times with different initial states. This will give you multiple results, and the one with the lowest error is usually chosen to be the best result.
When I use K-Means I always run it several times and use the best result.
来源:https://stackoverflow.com/questions/25921762/changes-of-clustering-results-after-each-time-run-in-python-scikit-learn