Ukkonen's suffix tree algorithm in plain English

后端 未结 7 1407
梦谈多话
梦谈多话 2020-11-22 05:00

I feel a bit thick at this point. I\'ve spent days trying to fully wrap my head around suffix tree construction, but because I don\'t have a mathematical background, many of

7条回答
  •  攒了一身酷
    2020-11-22 05:29

    Thanks for the well explained tutorial by @jogojapan, I implemented the algorithm in Python.

    A couple of minor problems mentioned by @jogojapan turns out to be more sophisticated than I have expected, and need to be treated very carefully. It cost me several days to get my implementation robust enough (I suppose). Problems and solutions are listed below:

    1. End with Remainder > 0 It turns out this situation can also happen during the unfolding step, not just the end of the entire algorithm. When that happens, we can leave the remainder, actnode, actedge, and actlength unchanged, end the current unfolding step, and start another step by either keep folding or unfolding depending on if the next char in the original string is on the current path or not.

    2. Leap Over Nodes: When we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. We have to move forward to the right place to split, or insert a leaf. This process might be not that straightforward because during the moving the actlength and actedge keep changing all the way, when you have to move back to the root node, the actedge and actlength could be wrong because of those moves. We need additional variable(s) to keep that information.

    The other two problems have somehow been pointed out by @managonov

    1. Split Could Degenerate When trying to split an edge, sometime you'll find the split operation is right on a node. That case we only need add a new leaf to that node, take it as a standard edge split operation, which means the suffix links if there's any, should be maintained correspondingly.

    2. Hidden Suffix Links There is another special case which is incurred by problem 1 and problem 2. Sometimes we need to hop over several nodes to the right point for split, we might surpass the right point if we move by comparing the remainder string and the path labels. That case the suffix link will be neglected unintentionally, if there should be any. This could be avoided by remembering the right point when moving forward. The suffix link should be maintained if the split node already exists, or even the problem 1 happens during a unfolding step.

    Finally, my implementation in Python is as follows:

    • Python

    Tips: It includes a naive tree printing function in the code above, which is very important while debugging. It saved me a lot of time and is convenient for locating special cases.

提交回复
热议问题