I recetly come across an interview question : To find all the repeating substring in a given string with a minimal size of 2. The algorithm should be efficient one.
Cod
I don't know how suffix tree can get all the repeating substring, string "mississippi" build suffix tree like this:
sorry,I see. "At the end, iterate over each node with count>1 and print its path." "count" is how many this child node
tree-->|---mississippi m..mississippi
|
|---i-->|---ssi-->|---ssippi i .. ississippi
| | |
| | |---ppi issip,issipp,issippi
| |
| |---ppi ip, ipp, ippi
|
|---s-->|---si-->|---ssippi s .. ssissippi
| | |
| | |---ppi ssip, ssipp, ssippi
| |
| |---i-->|---ssippi si .. sissippi
| |
| |---ppi sip, sipp, sippi
|
|---p-->|---pi p, pp, ppi
|
|---i p, pi
--- Suffix Tree for "mississippi" ---
If you analyze the output for the string "AAAAAAAAAAAAAA"
, then there are O(n²) characters in it, so the algorithm is at least O(n²).
To achieve O(n²), just build the suffix tree for each suffix of s (indices [1..n], [2..n], [3..n], ..., [n..n]). It doesn't matter if one of the strings has no own end node, just count how often each node is used.
At the end, iterate over each node with count>1 and print its path.
That's just a wild idea, but worth a try (however, it consumes O(N) memory, where N is length of the primary string). The algorithm is not O(N), but maybe it can be optimized.
The idea is, that you don't want to make string comparisons often. You can collect the hash of read data (for example a sum of ASCII codes of read characters) and compare the hashes. If the hashes are equal, the strings may be equal (it has to be checked later). For example:
ABCAB
A -> (65)
B -> (131, 66)
C -> (198, 133, 67)
A -> (263, 198, 132, 65)
B -> (329, 264, 198, 131, 66)
Because you're interested only in 2+ length values, you have to omit the last value (because it always corresponds to the single character).
We see two equal values: 131 and 198. 131 stands for AB and reveals the pair, however 198 stands both for ABC and BCA, which have to be rejected by manual check.
That's only the idea, not the solution itself. The hash function may be extended to account the position of character in substring (or the sequence structure). Storage method of hash values may be changed to improve performance (however in cost of increased memory usage).
Hope I helped just a little :)