Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:

后端未结

关注

 2  1013

慢半拍i 2021-02-04 06:44

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a s

2条回答

一向 (楼主)

2021-02-04 07:02

From the scikit-learn faq you can do this by making a custom metric:

from leven import levenshtein       
import numpy as np
from sklearn.cluster import dbscan
data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)

0 讨论(0)

查看其它2个回答