How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

后端 未结 2 454
礼貌的吻别
礼貌的吻别 2021-01-18 08:05

I\'m trying to cluster some text documents using scikit-learn. I\'m trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g.

相关标签:
2条回答
  • 2021-01-18 08:45

    Have you considered implementing the search yourself?

    It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy.

    For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting).

    In other words, at which distance are two articles supposed to be clustered?

    If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context.

    Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.

    0 讨论(0)
  • 2021-01-18 09:01

    The following function for DBSCAN might help. I've written it to iterate over the hyperparameters eps and min_samples and included optional arguments for min and max clusters. As DBSCAN is unsupervised, I have not included an evaluation parameter.

    def dbscan_grid_search(X_data, lst, clst_count, eps_space = 0.5,
                           min_samples_space = 5, min_clust = 0, max_clust = 10):
    
        """
    Performs a hyperparameter grid search for DBSCAN.
    
    Parameters:
        * X_data            = data used to fit the DBSCAN instance
        * lst               = a list to store the results of the grid search
        * clst_count        = a list to store the number of non-whitespace clusters
        * eps_space         = the range values for the eps parameter
        * min_samples_space = the range values for the min_samples parameter
        * min_clust         = the minimum number of clusters required after each search iteration in order for a result to be appended to the lst
        * max_clust         = the maximum number of clusters required after each search iteration in order for a result to be appended to the lst
    
    
    Example:
    
    # Loading Libraries
    from sklearn import datasets
    from sklearn.preprocessing import StandardScaler
    import pandas as pd
    
    # Loading iris dataset
    iris = datasets.load_iris()
    X = iris.data[:, :] 
    y = iris.target
    
    # Scaling X data
    dbscan_scaler = StandardScaler()
    
    dbscan_scaler.fit(X)
    
    dbscan_X_scaled = dbscan_scaler.transform(X)
    
    # Setting empty lists in global environment
    dbscan_clusters = []
    cluster_count   = []
    
    
    # Inputting function parameters
    dbscan_grid_search(X_data = dbscan_X_scaled,
                       lst = dbscan_clusters,
                       clst_count = cluster_count
                       eps_space = pd.np.arange(0.1, 5, 0.1),
                       min_samples_space = pd.np.arange(1, 50, 1),
                       min_clust = 3,
                       max_clust = 6)
    
    """
    
        # Importing counter to count the amount of data in each cluster
        from collections import Counter
    
    
        # Starting a tally of total iterations
        n_iterations = 0
    
    
        # Looping over each combination of hyperparameters
        for eps_val in eps_space:
            for samples_val in min_samples_space:
    
                dbscan_grid = DBSCAN(eps = eps_val,
                                     min_samples = samples_val)
    
    
                # fit_transform
                clusters = dbscan_grid.fit_predict(X = X_data)
    
    
                # Counting the amount of data in each cluster
                cluster_count = Counter(clusters)
    
    
                # Saving the number of clusters
                n_clusters = sum(abs(pd.np.unique(clusters))) - 1
    
    
                # Increasing the iteration tally with each run of the loop
                n_iterations += 1
    
    
                # Appending the lst each time n_clusters criteria is reached
                if n_clusters >= min_clust and n_clusters <= max_clust:
    
                    dbscan_clusters.append([eps_val,
                                            samples_val,
                                            n_clusters])
    
    
                    clst_count.append(cluster_count)
    
        # Printing grid search summary information
        print(f"""Search Complete. \nYour list is now of length {len(lst)}. """)
        print(f"""Hyperparameter combinations checked: {n_iterations}. \n""")
    
    0 讨论(0)
提交回复
热议问题