ALGORITHM - String similarity score/hash

后端未结

关注

 8  1303

Is there a method to calculate something like general \"similarity score\" of a string? In a way that I am not comparing two strings together but rather I get some number/scores

相关标签:

8条回答

不要未来只要你来

2021-02-01 10:30
There are several such "scores", but they all depend on how you define similarity.
- I think the python library already has a soundex implementation.
- you can also compute the Levenshtein distance of two strings
- NYSIIS?
0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2021-02-01 10:30

I don't know if you are still into this, but in information theory there is a way to measure how much information a string or chunk of text has, maybe you could use that value as a hash in order to sort your strings. It is called entropy, and wikipedia has a nice article about it: https://en.wikipedia.org/wiki/Entropy_(information_theory)

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2021-02-01 10:31

For a fast way of determining string similarity, you might want to use fuzzy hashing.

0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2021-02-01 10:32

TL;DR: Python BK-tree

Interesting question. I have limited experience within this field, but since the Levenshtein distance fulfills the triangle inequality, I figured that there must be a way of computing some sort of absolute distance to an origin in order to find strings in the vicinity of each other without performing direct comparisons against all entries in the entire database.

While googling on some terms related to this, I found one particularly interesting thesis: Aspects of Metric Spaces in Computation by Matthew Adam Skala.

At page 26 he discusses similarity measures based on kd-trees and others, but concludes:

However, general metric spaces do not provide the geometry required by those techniques. For a general metric space with no other assumptions, it is necessary distance-based to use a distance-based approach that indexes points solely on the basis of their distance from each other. Burkhard and Keller [35] offered one of the first such index structures, now known as a BK-tree for their initials, in 1973. In a BK-tree, the metric is assumed to have a few discrete return values, each internal node contains a vantage point, and the subtrees correspond to the different values of the metric.

A blog entry about how BK-trees work can be found here.

In the thesis, Skala goes on describing other solutions to this problem, including VP-trees and GH-trees. Chapter 6 analyses distances based on the Levenshtein edit distance. He also presents some other interesting distance metrics for strings.

I also found Foundations of Multidimensional and Metric Data Structures, which seems relevant to your question.

0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2021-02-01 10:34
You may be interested in Hamming Distance. The Python function hamming_distance() computes the Hamming distance between two strings.
```
def hamming_distance(s1, s2):
    assert len(s1) == len(s2)
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-02-01 10:37

You might want to look at using a BK-Tree. Here is a discussion and python implementation.

A BK-Tree stores strings in a tree, sorted by Levenshtein distance to the parent nodes. This is normally used to prune the search space when looking for similar strings, but it seems that this tree would form a natural ordering that could be used to create clusters.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页