I\'ve been trying to implement Rabin-Karp algorithm in Java. I have hard time computing the rolling hash value in constant time. I\'ve found one implementation at http://algs4.c
First you need to understand how the hash is computed.
Lets take a simple case of base 10 strings. How would you guarantee that the hash code of a string is unique? Base 10 is what we use to represent numbers, and we don't have collisions!!
"523" = 5*10^2 + 2*10^1 + 3*10^0 = 523
using the above hash function you are guaranteed to get distinct hashes for every string.
Given the hash of "523", if you want to calculate the hash of "238", i.e. by jutting out the leftmost digit 5 and bringing in a new digit 8 from the right, you would have to do the following:
1) remove the effect of the 5 from the hash: hash = hash - 5*10^2 (523-500 = 23)
2) adjust the hash of the remaining chars by shifting by 1, hash = hash * 10
3) add the hash of the new character: hash = hash + 8 (230 + 8 = 238, which as we expected is the base 10 hash of "238")
Now let's extend this to all ascii characters. This takes us to the base 256 world. Therefore the hash of the same string "523" now is
= 5*256^2 + 2*256^1 + 3*256^0 = 327680 + 512 + 3 = 328195.
You can imagine as the string length increases you will will exceed the range of integer/long in most programming languages relatively quickly.
How can we solve this? The way this is routinely solved is by working with modulus a large prime number. The drawback of this method is that we will now get false positives as well, which is a small price to pay if it takes the runtime of your algorithm from quadratic to linear!
The complicated equation you quoted is nothing but the steps 1-3 above done with modulus math. The two modulus properties used above are ->
a) (a*b) % p = ((a % p) * (b % p)) % p
b) a % p = (a + p) % p
Lets go back to steps 1-3 mentioned above ->
1) (expanded using property a) hash = hash - ((5 % p)*(10^2 %p) %p)
vs. what you quoted
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
Here are is how the two are related!
- RM = 10^3 % p
- txt.charAt(i-M) % Q = 5 % p
- The additional + Q you see is just to ensure that the hash is not negative. See property b above.
2 & 3) hash = hash*10 + 8, vs txtHash = (txtHash*R + txt.charAt(i)) % Q; Is the same but with taking mod of the final hash result!
Looking at properties a & b more closely, should help you figure it out!