Most common substring of length X

前端 未结 8 2150
没有蜡笔的小新
没有蜡笔的小新 2021-02-09 13:49

I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed.

For example, if s=\"aoaoa\" and X=3

相关标签:
8条回答
  • 2021-02-09 14:37

    I don't see an easy way to do this in strictly O(n) time, unless X is fixed and can be considered a constant. If X is a parameter to the algorithm, then most simple ways of doing this will actually be O(n*X), as you will need to do comparison operations, string copies, hashes, etc., on a substring of length X at every iteration.

    (I'm imagining, for a minute, that s is a multi-gigabyte string, and that X is some number over a million, and not seeing any simple ways of doing string comparison, or hashing substrings of length X, that are O(1), and not dependent on the size of X)

    It might be possible to avoid string copies during scanning, by leaving everything in place, and to avoid re-hashing the entire substring -- perhaps by using an incremental hash algorithm where you can add a byte at a time, and remove the oldest byte -- but I don't know of any such algorithms that wouldn't result in huge numbers of collisions that would need to be filtered out with an expensive post-processing step.

    Update

    Keith Randall points out that this kind of hash is known as a rolling hash. It still remains, though, that you would have to store the starting string position for each match in your hash table, and then verify after scanning the string that all of your matches were true. You would need to sort the hashtable, which could contain n-X entries, based on the number of matches found for each hash key, and verify each result -- probably not doable in O(n).

    0 讨论(0)
  • 2021-02-09 14:38

    There's no way to do this in O(n).

    Feel free to downvote me if you can prove me wrong on this one, but I've got nothing.

    0 讨论(0)
提交回复
热议问题