Need help in understanding Rolling Hash computation in constant time for Rabin-Karp Implementation

前端 未结 2 1568
情书的邮戳
情书的邮戳 2021-02-14 01:28

I\'ve been trying to implement Rabin-Karp algorithm in Java. I have hard time computing the rolling hash value in constant time. I\'ve found one implementation at http://algs4.c

相关标签:
2条回答
  • 2021-02-14 01:38

    This is the "rolling" aspect of the hash. It's eliminating the contribution of the oldest character (txt.charAt(i-M)), and incorporating the contribution of the newest character(txt.charAt(i)).

    The hash function is defined as:

                M-1
    hash[i] = ( SUM { input[i-j] * R^j } ) % Q
                j=0
    

    (where I'm using ^ to denote "to the power of".)

    But this can be written as an efficient recursive implementation as:

    hash[i] = (txtHash*R - input[i-M]*(R^M) + input[i]) % Q
    

    Your reference code is doing this, but it's using various techniques to ensure that the result is always computed correctly (and efficiently).

    So, for instance, the + Q in the first expression has no mathematical effect, but it ensures that the result of the sum is always positive (if it goes negative, % Q doesn't have the desired effect). It's also breaking the calculation into stages, presumably to prevent numerical overflow.

    0 讨论(0)
  • 2021-02-14 01:47

    First you need to understand how the hash is computed.

    Lets take a simple case of base 10 strings. How would you guarantee that the hash code of a string is unique? Base 10 is what we use to represent numbers, and we don't have collisions!!

    "523" = 5*10^2 + 2*10^1 + 3*10^0 = 523

    using the above hash function you are guaranteed to get distinct hashes for every string.

    Given the hash of "523", if you want to calculate the hash of "238", i.e. by jutting out the leftmost digit 5 and bringing in a new digit 8 from the right, you would have to do the following:

    1) remove the effect of the 5 from the hash: hash = hash - 5*10^2 (523-500 = 23)

    2) adjust the hash of the remaining chars by shifting by 1, hash = hash * 10

    3) add the hash of the new character: hash = hash + 8 (230 + 8 = 238, which as we expected is the base 10 hash of "238")

    Now let's extend this to all ascii characters. This takes us to the base 256 world. Therefore the hash of the same string "523" now is

    = 5*256^2 + 2*256^1 + 3*256^0 = 327680 + 512 + 3 = 328195.

    You can imagine as the string length increases you will will exceed the range of integer/long in most programming languages relatively quickly.

    How can we solve this? The way this is routinely solved is by working with modulus a large prime number. The drawback of this method is that we will now get false positives as well, which is a small price to pay if it takes the runtime of your algorithm from quadratic to linear!

    The complicated equation you quoted is nothing but the steps 1-3 above done with modulus math. The two modulus properties used above are ->

    a) (a*b) % p = ((a % p) * (b % p)) % p

    b) a % p = (a + p) % p

    Lets go back to steps 1-3 mentioned above ->

    1) (expanded using property a) hash = hash - ((5 % p)*(10^2 %p) %p)

    vs. what you quoted

    txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;

    Here are is how the two are related!

    • RM = 10^3 % p
    • txt.charAt(i-M) % Q = 5 % p
    • The additional + Q you see is just to ensure that the hash is not negative. See property b above.

    2 & 3) hash = hash*10 + 8, vs txtHash = (txtHash*R + txt.charAt(i)) % Q; Is the same but with taking mod of the final hash result!

    Looking at properties a & b more closely, should help you figure it out!

    0 讨论(0)
提交回复
热议问题