Why does String.indexOf() not use KMP?

前端 未结 2 579
一整个雨季
一整个雨季 2020-12-24 14:30

I read the source code of java.lang.String and I was surprised to find that String.indexof() does not use the Knuth–Morris–Pratt algorithm? As we

相关标签:
2条回答
  • 2020-12-24 14:54

    KMP and several other asymptotically efficient string search methods like Boyer-Moore and Boyer-Moore-Horspool require extra memory -- in the case of KMP, O(m) memory, where m is the size of the substring being searched for. Although this is often acceptable, library designers have to make tradeoffs so that their code performs acceptably well in many different situations. Probably the main reason is that due to both the preprocessing required by KMP, and its more complex inner loop in the search phase, the constant factor slowdown may make it several times slower than the naive O(mn) substring search in many common cases (e.g. searching for a substring of < 10 characters in a long string). Also, someone searching for a large substring might be perplexed to find the runtime library running out of memory as it tries to allocate a large memory buffer for the KMP fallback function table.

    Perhaps a better question is why O(m+n)-time, O(1)-space algorithms like the Two-Way Algorithm have not yet been adopted by mainstream language runtime libraries. Again, the answer is likely to be the constant-factor slowdown in common cases. Nevertheless in at least one C runtime library implementation, the corresponding strstr() function has been updated to use this algorithm.

    Someone around me told me that for short string KMP is good enough, but if you need performance and you intend to use with large string then is not a good choice.

    Well, that's exactly backwards from my understanding, which is that the naive O(mn) substring search is good enough (and probably the best) for short strings, but will eventually lose out to asymptotically faster O(m+n) algorithms like KMP as the strings become longer.

    0 讨论(0)
  • 2020-12-24 15:05

    KMP has better worst-case performance, but actually requires a little bit of up-front computation (to generate the table of offsets). It also requires an initial memory allocation, which could also impact performance.

    For (presumably) common use-cases of searching in relatively short strings, this might actually end up slower than the primitive implementation.

    This, bundled with the fact that for really huge data sets you will probably be using more specialized data structures than a simple String means that the increased implementation (and possibly runtime) cost is not worth investing.

    Note that this might change in future Java versions, as the actual algorithm is not specified.

    0 讨论(0)
提交回复
热议问题