I have this code for Jaro-Winkler algorithm taken from this website. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Androi
Yes, this can be made a lot faster. For one thing, you don't need the StringBuffers at all. For another, you don't need a separate loop to count transpositions.
You can find my implementation here, and it should be a lot faster. It's under Apache 2.0 License.
I don't know much about Android and how it works with databases. WP7 has (will have :) ) SQL CE. The next step would typically be to work with your data. Add string lengths and limit your comparisons. Add indexes on both columns and sort by length and then by value. The index on length should be sorted as well. I had it run on an old server with 150 000 medical terms giving me suggestions and spell checking in under 0.5 seconds, users could barely notice it, especially if running on a separate thread.
I meant to blog about it for a long time (like 2 years :) ) because there is a need. But I finally manage to write few words about it and provide some tips. Please check it out here:
ISolvable.blogspot.com
Although it is for Microsoft platform, still general principles are the same.
Instead returning the common characters using GetCommonCharacters method, use a couple of arrays to keep the matches, similarly to the C version here https://github.com/miguelvps/c/blob/master/jarowinkler.c
/*Calculate matching characters*/
for (i = 0; i < al; i++) {
for (j = max(i - range, 0), l = min(i + range + 1, sl); j < l; j++) {
if (a[i] == s[j] && !sflags[j]) {
sflags[j] = 1;
aflags[i] = 1;
m++;
break;
}
}
}
Another optimization is to pre-calculate a bitmask for each string. Using that, check if the current character on the first string is present on the second. This can be done using efficient bitwise operations.
This will skip calculating the max/min and looping for missing characters.
Yes, but you aren't going to enjoy it. Replace all those new
ed StringBuffers with char arrays that are allocated in the constructor and never again, using integer indices to keep track of what's in them.
This pending Commons-Lang patch will give you some of the flavor.
I know this question has probably been solved for some time, but I would like to comment on the algorithm itself. When comparing a string against itself, the answer turns out to be 1/|string| off. When comparing slightly different values, the values also turn out to be lower.
The solution to this is to adjust 'm-1' to 'm' in the inner for-statement within the getCommonCharacters method. The code then works like a charm :)
See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance as well for some examples.