Hash function that hashes similar strings in the same bucket

徘徊边缘 提交于 2020-01-02 14:11:26

问题


I'm searching for a "bad" hash function: I'd like to hash strings and put similar strings in one bucket.

Can you give me a hint where to start my research? Some methods or algorithm names...


回答1:


Your problem is not an easy one. Two ideas:

This solution might be overly complicated but you could try a Fourier transform. Treat your input text as a series of samples of a function and then run a Fourier transform to convert your input to the frequency domain. The low frequency part is the general jist of the text and the high frequency part is the tiny changes.

This is somewhat similar to what jpeg compression does: Throw away the details and just leave the important stuff. If you have two almost-identical images and you jpeg compress them greatly, you usually get the same output.

pHash uses a method similar to this.

Again, this is going to be a pretty complicated way to do it.

Second idea: minHash

The idea for minHash is that you pick some markers that are likely to be the same when the inputs are the same. Then you compute a vector for the outputs of all the markers. If two inputs have similar vectors then the inputs are similar.

For example, count how many times the word "the" appears in the text. If it's even, 0, if it's odd, 1. Now count how many times the word "math" shows up in the text. Again, 0 for even, 1 for odd. Do that for a lot of words.

Now you process all the texts and each one gives you an output like "011100010101" or whatever. If two texts are similar then they will have similar outputs strings, differing by just 1 or two bits. You can use a multi-variate partition trie (MVP) to search the outputs efficiently.

This, too, might be overkill for your problem.




回答2:


It depends on what you mean by "similar string" ?

But if you look for such a bad one, you have to build it yourself.

Example :

  • you can create 10 buckets (0 to 9) and group the strings by theirs length mod 10

  • Use a strcmp() like function and group them by the differences with a defined String



来源:https://stackoverflow.com/questions/3422820/hash-function-that-hashes-similar-strings-in-the-same-bucket

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!