How do I convert between a measure of similarity and a measure of difference (distance)?

前端 未结 8 1711
悲&欢浪女
悲&欢浪女 2021-02-02 01:57

Is there a general way to convert between a measure of similarity and a measure of distance?

Consider a similarity measure like the number of 2-grams that two strings ha

相关标签:
8条回答
  • 2021-02-02 02:32

    Cosine similarity is widely used for n-gram count or TFIDF vectors.

    from math import pi, acos
    def similarity(x, y):
        return sum(x[k] * y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
    

    Cosine similarity can be used to compute a formal distance metric according to wikipedia. It obeys all the properties of a distance that you would expect (symmetry, nonnegativity, etc):

    def distance_metric(x, y):
        return 1 - 2 * acos(similarity(x, y)) / pi
    

    Both of these metrics range between 0 and 1.

    If you have a tokenizer that produces N-grams from a string you could use these metrics like this:

    >>> import Tokenizer
    >>> tokenizer = Tokenizer(ngrams=2, lower=True, nonwords_set=set(['hello', 'and']))
    
    >>> from Collections import Counter
    >>> list(tokenizer('Hello World again and again?'))
    ['world', 'again', 'again', 'world again', 'again again']
    >>> Counter(tokenizer('Hello World again and again?'))
    Counter({'again': 2, 'world': 1, 'again again': 1, 'world again': 1})
    >>> x = _
    >>> Counter(tokenizer('Hi world once again.'))
    Counter({'again': 1, 'world once': 1, 'hi': 1, 'once again': 1, 'world': 1, 'hi world': 1, 'once': 1})
    >>> y = _
    >>> sum(x[k]*y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
    0.42857142857142855
    >>> distance_metric(x, y)
    0.28196592805724774
    

    I found the elegant inner product of Counter in this SO answer

    0 讨论(0)
  • 2021-02-02 02:34

    In the case of Levenshtein distance, you could increase the sim score by 1 for every time the sequences match; that is, 1 for every time you didn't need a deletion, insertion or substitution. That way the metric would be a linear measure of how many characters the two strings have in common.

    0 讨论(0)
提交回复
热议问题