Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:

后端 未结 2 1009
慢半拍i
慢半拍i 2021-02-04 06:44

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a s

2条回答
  •  一向
    一向 (楼主)
    2021-02-04 07:02

    From the scikit-learn faq you can do this by making a custom metric:

    from leven import levenshtein       
    import numpy as np
    from sklearn.cluster import dbscan
    data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
    def lev_metric(x, y):
        i, j = int(x[0]), int(y[0])     # extract indices
        return levenshtein(data[i], data[j])
    
    X = np.arange(len(data)).reshape(-1, 1)
    dbscan(X, metric=lev_metric, eps=5, min_samples=2)
    

提交回复
热议问题