Optimizing Jaro-Winkler algorithm

前端 未结 6 2020
孤独总比滥情好
孤独总比滥情好 2021-02-05 13:45

I have this code for Jaro-Winkler algorithm taken from this website. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Androi

相关标签:
6条回答
  • 2021-02-05 14:05

    Yes, this can be made a lot faster. For one thing, you don't need the StringBuffers at all. For another, you don't need a separate loop to count transpositions.

    You can find my implementation here, and it should be a lot faster. It's under Apache 2.0 License.

    0 讨论(0)
  • 2021-02-05 14:07

    I don't know much about Android and how it works with databases. WP7 has (will have :) ) SQL CE. The next step would typically be to work with your data. Add string lengths and limit your comparisons. Add indexes on both columns and sort by length and then by value. The index on length should be sorted as well. I had it run on an old server with 150 000 medical terms giving me suggestions and spell checking in under 0.5 seconds, users could barely notice it, especially if running on a separate thread.

    I meant to blog about it for a long time (like 2 years :) ) because there is a need. But I finally manage to write few words about it and provide some tips. Please check it out here:

    ISolvable.blogspot.com

    Although it is for Microsoft platform, still general principles are the same.

    0 讨论(0)
  • 2021-02-05 14:08

    Instead returning the common characters using GetCommonCharacters method, use a couple of arrays to keep the matches, similarly to the C version here https://github.com/miguelvps/c/blob/master/jarowinkler.c

    /*Calculate matching characters*/
    for (i = 0; i < al; i++) {
        for (j = max(i - range, 0), l = min(i + range + 1, sl); j < l; j++) {
            if (a[i] == s[j] && !sflags[j]) {
                sflags[j] = 1;
                aflags[i] = 1;
                m++;
                break;
            }
        }
    }
    

    Another optimization is to pre-calculate a bitmask for each string. Using that, check if the current character on the first string is present on the second. This can be done using efficient bitwise operations.

    This will skip calculating the max/min and looping for missing characters.

    0 讨论(0)
  • 2021-02-05 14:09
    1. Try to avoid the two nested loops in the getCommonCharacters loop.
      Suggestion as to how: store all the chars in the smaller string in a map of some sort(java has a few), where the key is the character and the value is the position, that way you can still calculate the distance, wether they are in common. I don't quite understand the algorithm, but I think this is doable.
    2. Except for that and bmargulies's answer, I really don't see further optimizations beyond stuff like bits etc. If this is really critical, consider rewriting this portion in C?
    0 讨论(0)
  • 2021-02-05 14:18

    Yes, but you aren't going to enjoy it. Replace all those newed StringBuffers with char arrays that are allocated in the constructor and never again, using integer indices to keep track of what's in them.

    This pending Commons-Lang patch will give you some of the flavor.

    0 讨论(0)
  • 2021-02-05 14:27

    I know this question has probably been solved for some time, but I would like to comment on the algorithm itself. When comparing a string against itself, the answer turns out to be 1/|string| off. When comparing slightly different values, the values also turn out to be lower.

    The solution to this is to adjust 'm-1' to 'm' in the inner for-statement within the getCommonCharacters method. The code then works like a charm :)

    See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance as well for some examples.

    0 讨论(0)
提交回复
热议问题