I feel a bit thick at this point. I\'ve spent days trying to fully wrap my head around suffix tree construction, but because I don\'t have a mathematical background, many of
@jogojapan you brought awesome explanation and visualisation. But as @makagonov mentioned it's missing some rules regarding setting suffix links. It's visible in nice way when going step by step on http://brenden.github.io/ukkonen-animation/ through word 'aabaaabb'. When you go from step 10 to step 11, there is no suffix link from node 5 to node 2 but active point suddenly moves there.
@makagonov since I live in Java world I also tried to follow your implementation to grasp ST building workflow but it was hard to me because of:
So I ended up with such implementation in Java which I hope reflects all steps in clearer way and will reduce learning time for other Java people:
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
public class ST {
public class Node {
private final int id;
private final Map<Character, Edge> edges;
private Node slink;
public Node(final int id) {
this.id = id;
this.edges = new HashMap<>();
}
public void setSlink(final Node slink) {
this.slink = slink;
}
public Map<Character, Edge> getEdges() {
return this.edges;
}
public Node getSlink() {
return this.slink;
}
public String toString(final String word) {
return new StringBuilder()
.append("{")
.append("\"id\"")
.append(":")
.append(this.id)
.append(",")
.append("\"slink\"")
.append(":")
.append(this.slink != null ? this.slink.id : null)
.append(",")
.append("\"edges\"")
.append(":")
.append(edgesToString(word))
.append("}")
.toString();
}
private StringBuilder edgesToString(final String word) {
final StringBuilder edgesStringBuilder = new StringBuilder();
edgesStringBuilder.append("{");
for(final Map.Entry<Character, Edge> entry : this.edges.entrySet()) {
edgesStringBuilder.append("\"")
.append(entry.getKey())
.append("\"")
.append(":")
.append(entry.getValue().toString(word))
.append(",");
}
if(!this.edges.isEmpty()) {
edgesStringBuilder.deleteCharAt(edgesStringBuilder.length() - 1);
}
edgesStringBuilder.append("}");
return edgesStringBuilder;
}
public boolean contains(final String word, final String suffix) {
return !suffix.isEmpty()
&& this.edges.containsKey(suffix.charAt(0))
&& this.edges.get(suffix.charAt(0)).contains(word, suffix);
}
}
public class Edge {
private final int from;
private final int to;
private final Node next;
public Edge(final int from, final int to, final Node next) {
this.from = from;
this.to = to;
this.next = next;
}
public int getFrom() {
return this.from;
}
public int getTo() {
return this.to;
}
public Node getNext() {
return this.next;
}
public int getLength() {
return this.to - this.from;
}
public String toString(final String word) {
return new StringBuilder()
.append("{")
.append("\"content\"")
.append(":")
.append("\"")
.append(word.substring(this.from, this.to))
.append("\"")
.append(",")
.append("\"next\"")
.append(":")
.append(this.next != null ? this.next.toString(word) : null)
.append("}")
.toString();
}
public boolean contains(final String word, final String suffix) {
if(this.next == null) {
return word.substring(this.from, this.to).equals(suffix);
}
return suffix.startsWith(word.substring(this.from,
this.to)) && this.next.contains(word, suffix.substring(this.to - this.from));
}
}
public class ActivePoint {
private final Node activeNode;
private final Character activeEdgeFirstCharacter;
private final int activeLength;
public ActivePoint(final Node activeNode,
final Character activeEdgeFirstCharacter,
final int activeLength) {
this.activeNode = activeNode;
this.activeEdgeFirstCharacter = activeEdgeFirstCharacter;
this.activeLength = activeLength;
}
private Edge getActiveEdge() {
return this.activeNode.getEdges().get(this.activeEdgeFirstCharacter);
}
public boolean pointsToActiveNode() {
return this.activeLength == 0;
}
public boolean activeNodeIs(final Node node) {
return this.activeNode == node;
}
public boolean activeNodeHasEdgeStartingWith(final char character) {
return this.activeNode.getEdges().containsKey(character);
}
public boolean activeNodeHasSlink() {
return this.activeNode.getSlink() != null;
}
public boolean pointsToOnActiveEdge(final String word, final char character) {
return word.charAt(this.getActiveEdge().getFrom() + this.activeLength) == character;
}
public boolean pointsToTheEndOfActiveEdge() {
return this.getActiveEdge().getLength() == this.activeLength;
}
public boolean pointsAfterTheEndOfActiveEdge() {
return this.getActiveEdge().getLength() < this.activeLength;
}
public ActivePoint moveToEdgeStartingWithAndByOne(final char character) {
return new ActivePoint(this.activeNode, character, 1);
}
public ActivePoint moveToNextNodeOfActiveEdge() {
return new ActivePoint(this.getActiveEdge().getNext(), null, 0);
}
public ActivePoint moveToSlink() {
return new ActivePoint(this.activeNode.getSlink(),
this.activeEdgeFirstCharacter,
this.activeLength);
}
public ActivePoint moveTo(final Node node) {
return new ActivePoint(node, this.activeEdgeFirstCharacter, this.activeLength);
}
public ActivePoint moveByOneCharacter() {
return new ActivePoint(this.activeNode,
this.activeEdgeFirstCharacter,
this.activeLength + 1);
}
public ActivePoint moveToEdgeStartingWithAndByActiveLengthMinusOne(final Node node,
final char character) {
return new ActivePoint(node, character, this.activeLength - 1);
}
public ActivePoint moveToNextNodeOfActiveEdge(final String word, final int index) {
return new ActivePoint(this.getActiveEdge().getNext(),
word.charAt(index - this.activeLength + this.getActiveEdge().getLength()),
this.activeLength - this.getActiveEdge().getLength());
}
public void addEdgeToActiveNode(final char character, final Edge edge) {
this.activeNode.getEdges().put(character, edge);
}
public void splitActiveEdge(final String word,
final Node nodeToAdd,
final int index,
final char character) {
final Edge activeEdgeToSplit = this.getActiveEdge();
final Edge splittedEdge = new Edge(activeEdgeToSplit.getFrom(),
activeEdgeToSplit.getFrom() + this.activeLength,
nodeToAdd);
nodeToAdd.getEdges().put(word.charAt(activeEdgeToSplit.getFrom() + this.activeLength),
new Edge(activeEdgeToSplit.getFrom() + this.activeLength,
activeEdgeToSplit.getTo(),
activeEdgeToSplit.getNext()));
nodeToAdd.getEdges().put(character, new Edge(index, word.length(), null));
this.activeNode.getEdges().put(this.activeEdgeFirstCharacter, splittedEdge);
}
public Node setSlinkTo(final Node previouslyAddedNodeOrAddedEdgeNode,
final Node node) {
if(previouslyAddedNodeOrAddedEdgeNode != null) {
previouslyAddedNodeOrAddedEdgeNode.setSlink(node);
}
return node;
}
public Node setSlinkToActiveNode(final Node previouslyAddedNodeOrAddedEdgeNode) {
return setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, this.activeNode);
}
}
private static int idGenerator;
private final String word;
private final Node root;
private ActivePoint activePoint;
private int remainder;
public ST(final String word) {
this.word = word;
this.root = new Node(idGenerator++);
this.activePoint = new ActivePoint(this.root, null, 0);
this.remainder = 0;
build();
}
private void build() {
for(int i = 0; i < this.word.length(); i++) {
add(i, this.word.charAt(i));
}
}
private void add(final int index, final char character) {
this.remainder++;
boolean characterFoundInTheTree = false;
Node previouslyAddedNodeOrAddedEdgeNode = null;
while(!characterFoundInTheTree && this.remainder > 0) {
if(this.activePoint.pointsToActiveNode()) {
if(this.activePoint.activeNodeHasEdgeStartingWith(character)) {
activeNodeHasEdgeStartingWithCharacter(character, previouslyAddedNodeOrAddedEdgeNode);
characterFoundInTheTree = true;
}
else {
if(this.activePoint.activeNodeIs(this.root)) {
rootNodeHasNotEdgeStartingWithCharacter(index, character);
}
else {
previouslyAddedNodeOrAddedEdgeNode = internalNodeHasNotEdgeStartingWithCharacter(index,
character, previouslyAddedNodeOrAddedEdgeNode);
}
}
}
else {
if(this.activePoint.pointsToOnActiveEdge(this.word, character)) {
activeEdgeHasCharacter();
characterFoundInTheTree = true;
}
else {
if(this.activePoint.activeNodeIs(this.root)) {
previouslyAddedNodeOrAddedEdgeNode = edgeFromRootNodeHasNotCharacter(index,
character,
previouslyAddedNodeOrAddedEdgeNode);
}
else {
previouslyAddedNodeOrAddedEdgeNode = edgeFromInternalNodeHasNotCharacter(index,
character,
previouslyAddedNodeOrAddedEdgeNode);
}
}
}
}
}
private void activeNodeHasEdgeStartingWithCharacter(final char character,
final Node previouslyAddedNodeOrAddedEdgeNode) {
this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
this.activePoint = this.activePoint.moveToEdgeStartingWithAndByOne(character);
if(this.activePoint.pointsToTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
private void rootNodeHasNotEdgeStartingWithCharacter(final int index, final char character) {
this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
this.activePoint = this.activePoint.moveTo(this.root);
this.remainder--;
assert this.remainder == 0;
}
private Node internalNodeHasNotEdgeStartingWithCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
if(this.activePoint.activeNodeHasSlink()) {
this.activePoint = this.activePoint.moveToSlink();
}
else {
this.activePoint = this.activePoint.moveTo(this.root);
}
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private void activeEdgeHasCharacter() {
this.activePoint = this.activePoint.moveByOneCharacter();
if(this.activePoint.pointsToTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
private Node edgeFromRootNodeHasNotCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
final Node newNode = new Node(idGenerator++);
this.activePoint.splitActiveEdge(this.word, newNode, index, character);
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
this.activePoint = this.activePoint.moveToEdgeStartingWithAndByActiveLengthMinusOne(this.root,
this.word.charAt(index - this.remainder + 2));
this.activePoint = walkDown(index);
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private Node edgeFromInternalNodeHasNotCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
final Node newNode = new Node(idGenerator++);
this.activePoint.splitActiveEdge(this.word, newNode, index, character);
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
if(this.activePoint.activeNodeHasSlink()) {
this.activePoint = this.activePoint.moveToSlink();
}
else {
this.activePoint = this.activePoint.moveTo(this.root);
}
this.activePoint = walkDown(index);
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private ActivePoint walkDown(final int index) {
while(!this.activePoint.pointsToActiveNode()
&& (this.activePoint.pointsToTheEndOfActiveEdge() || this.activePoint.pointsAfterTheEndOfActiveEdge())) {
if(this.activePoint.pointsAfterTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge(this.word, index);
}
else {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
return this.activePoint;
}
public String toString(final String word) {
return this.root.toString(word);
}
public boolean contains(final String suffix) {
return this.root.contains(this.word, suffix);
}
public static void main(final String[] args) {
final String[] words = {
"abcabcabc$",
"abc$",
"abcabxabcd$",
"abcabxabda$",
"abcabxad$",
"aabaaabb$",
"aababcabcd$",
"ababcabcd$",
"abccba$",
"mississipi$",
"abacabadabacabae$",
"abcabcd$",
"00132220$"
};
Arrays.stream(words).forEach(word -> {
System.out.println("Building suffix tree for word: " + word);
final ST suffixTree = new ST(word);
System.out.println("Suffix tree: " + suffixTree.toString(word));
for(int i = 0; i < word.length() - 1; i++) {
assert suffixTree.contains(word.substring(i)) : word.substring(i);
}
});
}
}
Hi i have tried to implement the above explained implementation in ruby , please check it out. it seems to work fine.
the only difference in the implementation is that , i have tried to use the edge object instead of just using symbols.
its also present at https://gist.github.com/suchitpuri/9304856
require 'pry'
class Edge
attr_accessor :data , :edges , :suffix_link
def initialize data
@data = data
@edges = []
@suffix_link = nil
end
def find_edge element
self.edges.each do |edge|
return edge if edge.data.start_with? element
end
return nil
end
end
class SuffixTrees
attr_accessor :root , :active_point , :remainder , :pending_prefixes , :last_split_edge , :remainder
def initialize
@root = Edge.new nil
@active_point = { active_node: @root , active_edge: nil , active_length: 0}
@remainder = 0
@pending_prefixes = []
@last_split_edge = nil
@remainder = 1
end
def build string
string.split("").each_with_index do |element , index|
add_to_edges @root , element
update_pending_prefix element
add_pending_elements_to_tree element
active_length = @active_point[:active_length]
# if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data[0..active_length-1] == @active_point[:active_edge].data[active_length..@active_point[:active_edge].data.length-1])
# @active_point[:active_edge].data = @active_point[:active_edge].data[0..active_length-1]
# @active_point[:active_edge].edges << Edge.new(@active_point[:active_edge].data)
# end
if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data.length == @active_point[:active_length] )
@active_point[:active_node] = @active_point[:active_edge]
@active_point[:active_edge] = @active_point[:active_node].find_edge(element[0])
@active_point[:active_length] = 0
end
end
end
def add_pending_elements_to_tree element
to_be_deleted = []
update_active_length = false
# binding.pry
if( @active_point[:active_node].find_edge(element[0]) != nil)
@active_point[:active_length] = @active_point[:active_length] + 1
@active_point[:active_edge] = @active_point[:active_node].find_edge(element[0]) if @active_point[:active_edge] == nil
@remainder = @remainder + 1
return
end
@pending_prefixes.each_with_index do |pending_prefix , index|
# binding.pry
if @active_point[:active_edge] == nil and @active_point[:active_node].find_edge(element[0]) == nil
@active_point[:active_node].edges << Edge.new(element)
else
@active_point[:active_edge] = node.find_edge(element[0]) if @active_point[:active_edge] == nil
data = @active_point[:active_edge].data
data = data.split("")
location = @active_point[:active_length]
# binding.pry
if(data[0..location].join == pending_prefix or @active_point[:active_node].find_edge(element) != nil )
else #tree split
split_edge data , index , element
end
end
end
end
def update_pending_prefix element
if @active_point[:active_edge] == nil
@pending_prefixes = [element]
return
end
@pending_prefixes = []
length = @active_point[:active_edge].data.length
data = @active_point[:active_edge].data
@remainder.times do |ctr|
@pending_prefixes << data[-(ctr+1)..data.length-1]
end
@pending_prefixes.reverse!
end
def split_edge data , index , element
location = @active_point[:active_length]
old_edges = []
internal_node = (@active_point[:active_edge].edges != nil)
if (internal_node)
old_edges = @active_point[:active_edge].edges
@active_point[:active_edge].edges = []
end
@active_point[:active_edge].data = data[0..location-1].join
@active_point[:active_edge].edges << Edge.new(data[location..data.size].join)
if internal_node
@active_point[:active_edge].edges << Edge.new(element)
else
@active_point[:active_edge].edges << Edge.new(data.last)
end
if internal_node
@active_point[:active_edge].edges[0].edges = old_edges
end
#setup the suffix link
if @last_split_edge != nil and @last_split_edge.data.end_with?@active_point[:active_edge].data
@last_split_edge.suffix_link = @active_point[:active_edge]
end
@last_split_edge = @active_point[:active_edge]
update_active_point index
end
def update_active_point index
if(@active_point[:active_node] == @root)
@active_point[:active_length] = @active_point[:active_length] - 1
@remainder = @remainder - 1
@active_point[:active_edge] = @active_point[:active_node].find_edge(@pending_prefixes.first[index+1])
else
if @active_point[:active_node].suffix_link != nil
@active_point[:active_node] = @active_point[:active_node].suffix_link
else
@active_point[:active_node] = @root
end
@active_point[:active_edge] = @active_point[:active_node].find_edge(@active_point[:active_edge].data[0])
@remainder = @remainder - 1
end
end
def add_to_edges root , element
return if root == nil
root.data = root.data + element if(root.data and root.edges.size == 0)
root.edges.each do |edge|
add_to_edges edge , element
end
end
end
suffix_tree = SuffixTrees.new
suffix_tree.build("abcabxabcd")
binding.pry
Thanks for the well explained tutorial by @jogojapan, I implemented the algorithm in Python.
A couple of minor problems mentioned by @jogojapan turns out to be more sophisticated than I have expected, and need to be treated very carefully. It cost me several days to get my implementation robust enough (I suppose). Problems and solutions are listed below:
End with Remainder > 0
It turns out this situation can also happen during the unfolding step, not just the end of the entire algorithm. When that happens, we can leave the remainder, actnode, actedge, and actlength unchanged, end the current unfolding step, and start another step by either keep folding or unfolding depending on if the next char in the original string is on the current path or not.
Leap Over Nodes: When we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. We have to move forward to the right place to split, or insert a leaf. This process might be not that straightforward because during the moving the actlength and actedge keep changing all the way, when you have to move back to the root node, the actedge and actlength could be wrong because of those moves. We need additional variable(s) to keep that information.
The other two problems have somehow been pointed out by @managonov
Split Could Degenerate When trying to split an edge, sometime you'll find the split operation is right on a node. That case we only need add a new leaf to that node, take it as a standard edge split operation, which means the suffix links if there's any, should be maintained correspondingly.
Hidden Suffix Links There is another special case which is incurred by problem 1 and problem 2. Sometimes we need to hop over several nodes to the right point for split, we might surpass the right point if we move by comparing the remainder string and the path labels. That case the suffix link will be neglected unintentionally, if there should be any. This could be avoided by remembering the right point when moving forward. The suffix link should be maintained if the split node already exists, or even the problem 1 happens during a unfolding step.
Finally, my implementation in Python is as follows:
Tips: It includes a naive tree printing function in the code above, which is very important while debugging. It saved me a lot of time and is convenient for locating special cases.
The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.
First, a few preliminary statements.
What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth
But: Unlike in a search trie, the edge labels are not single
characters. Instead, each edge is labeled using a pair of integers
[from,to]
. These are pointers into the text. In this sense, each
edge carries a string label of arbitrary length, but takes only O(1)
space (two pointers).
I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:
abc
The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).
So, we start from the left, and first insert only the single character
a
by creating an edge from the root node (on the left) to a leaf,
and labeling it as [0,#]
, which means the edge represents the
substring starting at position 0 and ending at the current end. I
use the symbol #
to mean the current end, which is at position 1
(right after a
).
So we have an initial tree, which looks like this:
And what it means is this:
Now we progress to position 2 (right after b
). Our goal at each step
is to insert all suffixes up to the current position. We do this
by
a
-edge to ab
b
In our representation this looks like
And what it means is:
We observe two things:
ab
is the same as it used to be
in the initial tree: [0,#]
. Its meaning has automatically changed
because we updated the current position #
from 1 to 2.Next we increment the position again and update the tree by appending
a c
to every existing edge and inserting one new edge for the new
suffix c
.
In our representation this looks like
And what it means is:
We observe:
#
, and inserting the
one new edge for the final character can be done in O(1)
time. Hence for a string of length n, only O(n) time is required.Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:
abcabxabcd
It starts with abc
as in the previous example, then ab
is repeated
and followed by x
, and then abc
is repeated followed by d
.
Steps 1 through 3: After the first 3 steps we have the tree from the previous example:
Step 4: We move #
to position 4. This implicitly updates all existing
edges to this:
and we need to insert the final suffix of the current step, a
, at
the root.
Before we do this, we introduce two more variables (in addition to
#
), which of course have been there all the time but we haven't used
them so far:
(active_node,active_edge,active_length)
remainder
, which is an integer indicating how many new suffixes
we need to insertThe exact meaning of these two will become clear soon, but for now let's just say:
abc
example, the active point was always
(root,'\0x',0)
, i.e. active_node
was the root node, active_edge
was specified as the null character '\0x'
, and active_length
was zero. The effect of this was that the one new edge that
we inserted in every step was inserted at the root node as a
freshly created edge. We will see soon why a triple is necessary to
represent this information.remainder
was always set to 1 at the beginning of each
step. The meaning of this was that the number of suffixes we had to
actively insert at the end of each step was 1 (always just the
final character).Now this is going to change. When we insert the current final
character a
at the root, we notice that there is already an outgoing
edge starting with a
, specifically: abca
. Here is what we do in
such a case:
[4,#]
at the root node. Instead we
simply notice that the suffix a
is already in our
tree. It ends in the middle of a longer edge, but we are not
bothered by that. We just leave things the way they are.(root,'a',1)
. That means the active
point is now somewhere in the middle of outgoing edge of the root node that starts with a
, specifically, after position 1 on that edge. We
notice that the edge is specified simply by its first
character a
. That suffices because there can be only one edge
starting with any particular character (confirm that this is true after reading through the entire description).remainder
, so at the beginning of the next step
it will be 2.Observation: When the final suffix we need to insert is found to
exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder
). The tree
is then not an accurate representation of the suffix tree up to the
current position any more, but it contains all suffixes (because the final
suffix a
is contained implicitly). Hence, apart from updating the
variables (which are all of fixed length, so this is O(1)), there was
no work done in this step.
Step 5: We update the current position #
to 5. This
automatically updates the tree to this:
And because remainder
is 2, we need to insert two final
suffixes of the current position: ab
and b
. This is basically because:
a
suffix from the previous step has never been properly
inserted. So it has remained, and since we have progressed one
step, it has now grown from a
to ab
.b
.In practice this means that we go to the active point (which points to
behind the a
on what is now the abcab
edge), and insert the
current final character b
. But: Again, it turns out that b
is
also already present on that same edge.
So, again, we do not change the tree. We simply:
(root,'a',2)
(same node and edge
as before, but now we point to behind the b
)remainder
to 3 because we still have not properly
inserted the final edge from the previous step, and we don't insert
the current final edge either.To be clear: We had to insert ab
and b
in the current step, but
because ab
was already found, we updated the active point and did
not even attempt to insert b
. Why? Because if ab
is in the tree,
every suffix of it (including b
) must be in the tree,
too. Perhaps only implicitly, but it must be there, because of the
way we have built the tree so far.
We proceed to step 6 by incrementing #
. The tree is
automatically updated to:
Because remainder
is 3, we have to insert abx
, bx
and
x
. The active point tells us where ab
ends, so we only need to
jump there and insert the x
. Indeed, x
is not there yet, so we
split the abcabx
edge and insert an internal node:
The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.
So we have dealt with abx
and decrement remainder
to 2. Now we
need to insert the next remaining suffix, bx
. But before we do that
we need to update the active point. The rule for this, after splitting
and inserting an edge, will be called Rule 1 below, and it applies whenever the
active_node
is root (we will learn rule 3 for other cases further
below). Here is rule 1:
After an insertion from root,
active_node
remains rootactive_edge
is set to the first character of the new suffix we need to insert, i.e.b
active_length
is reduced by 1
Hence, the new active-point triple (root,'b',1)
indicates that the
next insert has to be made at the bcabx
edge, behind 1 character,
i.e. behind b
. We can identify the insertion point in O(1) time and
check whether x
is already present or not. If it was present, we
would end the current step and leave everything the way it is. But x
is not present, so we insert it by splitting the edge:
Again, this took O(1) time and we update remainder
to 1 and the
active point to (root,'x',0)
as rule 1 states.
But there is one more thing we need to do. We'll call this Rule 2:
If we split an edge and insert a new node, and if that is not the first node created during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:
We still need to insert the final suffix of the current step,
x
. Since the active_length
component of the active node has fallen
to 0, the final insert is made at the root directly. Since there is no
outgoing edge at the root node starting with x
, we insert a new
edge:
As we can see, in the current step all remaining inserts were made.
We proceed to step 7 by setting #
=7, which automatically appends the next character,
a
, to all leaf edges, as always. Then we attempt to insert the new final
character to the active point (the root), and find that it is there
already. So we end the current step without inserting anything and
update the active point to (root,'a',1)
.
In step 8, #
=8, we append b
, and as seen before, this only
means we update the active point to (root,'a',2)
and increment remainder
without doing
anything else, because b
is already present. However, we notice (in O(1) time) that the active point
is now at the end of an edge. We reflect this by re-setting it to
(node1,'\0x',0)
. Here, I use node1
to refer to the
internal node the ab
edge ends at.
Then, in step #
=9, we need to insert 'c' and this will help us to
understand the final trick:
As always, the #
update appends c
automatically to the leaf edges
and we go to the active point to see if we can insert 'c'. It turns
out 'c' exists already at that edge, so we set the active point to
(node1,'c',1)
, increment remainder
and do nothing else.
Now in step #
=10, remainder
is 4, and so we first need to insert
abcd
(which remains from 3 steps ago) by inserting d
at the active
point.
Attempting to insert d
at the active point causes an edge split in
O(1) time:
The active_node
, from which the split was initiated, is marked in
red above. Here is the final rule, Rule 3:
After splitting an edge from an
active_node
that is not the root node, we follow the suffix link going out of that node, if there is any, and reset theactive_node
to the node it points to. If there is no suffix link, we set theactive_node
to the root.active_edge
andactive_length
remain unchanged.
So the active point is now (node2,'c',1)
, and node2
is marked in
red below:
Since the insertion of abcd
is complete, we decrement remainder
to
3 and consider the next remaining suffix of the current step,
bcd
. Rule 3 has set the active point to just the right node and edge
so inserting bcd
can be done by simply inserting its final character
d
at the active point.
Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:
We observe: Suffix links enable us to reset the active point so we
can make the next remaining insert at O(1) effort. Look at the
graph above to confirm that indeed node at label ab
is linked to
the node at b
(its suffix), and the node at abc
is linked to
bc
.
The current step is not finished yet. remainder
is now 2, and we
need to follow rule 3 to reset the active point again. Since the
current active_node
(red above) has no suffix link, we reset to
root. The active point is now (root,'c',1)
.
Hence the next insert occurs at the one outgoing edge of the root node
whose label starts with c
: cabxabcd
, behind the first character,
i.e. behind c
. This causes another split:
And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:
(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)
With this, remainder
can be set to 1 and since the active_node
is
root, we use rule 1 to update the active point to (root,'d',0)
. This
means the final insert of the current step is to insert a single d
at root:
That was the final step and we are done. There are number of final observations, though:
In each step we move #
forward by 1 position. This automatically
updates all leaf nodes in O(1) time.
But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step.
remainder
tells us how many additional inserts we need to
make. These inserts correspond one-to-one to the final suffixes of
the string that ends at the current position #
. We consider one
after the other and make the insert. Important: Each insert is
done in O(1) time since the active point tells us exactly where to
go, and we need to add only one single character at the active
point. Why? Because the other characters are contained implicitly
(otherwise the active point would not be where it is).
After each such insert, we decrement remainder
and follow the
suffix link if there is any. If not we go to root (rule 3). If we
are at root already, we modify the active point using rule 1. In
any case, it takes only O(1) time.
If, during one of these inserts, we find that the character we want
to insert is already there, we don't do anything and end the
current step, even if remainder
>0. The reason is that any
inserts that remain will be suffixes of the one we just tried to
make. Hence they are all implicit in the current tree. The fact
that remainder
>0 makes sure we deal with the remaining suffixes
later.
What if at the end of the algorithm remainder
>0? This will be the
case whenever the end of the text is a substring that occurred
somewhere before. In that case we must append one extra character
at the end of the string that has not occurred before. In the
literature, usually the dollar sign $
is used as a symbol for
that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder
to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP's comment below.
So what is the complexity of the entire algorithm? If the text is n
characters in length, there are obviously n steps (or n+1 if we add
the dollar sign). In each step we either do nothing (other than
updating the variables), or we make remainder
inserts, each taking O(1)
time. Since remainder
indicates how many times we have done nothing
in previous steps, and is decremented for every insert that we make
now, the total number of times we do something is exactly n (or
n+1). Hence, the total complexity is O(n).
However, there is one small thing that I did not properly explain:
It can happen that we follow a suffix link, update the active
point, and then find that its active_length
component does not
work well with the new active_node
. For example, consider a situation
like this:
(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)
Now let the active point be (red,'d',3)
, so it points to the place
behind the f
on the defg
edge. Now assume we made the necessary
updates and now follow the suffix link to update the active point
according to rule 3. The new active point is (green,'d',3)
. However,
the d
-edge going out of the green node is de
, so it has only 2
characters. In order to find the correct active point, we obviously
need to follow that edge to the blue node and reset to (blue,'f',1)
.
In a particularly bad case, the active_length
could be as large as
remainder
, which can be as large as n. And it might very well happen
that to find the correct active point, we need not only jump over one
internal node, but perhaps many, up to n in the worst case. Does that
mean the algorithm has a hidden O(n2) complexity, because
in each step remainder
is generally O(n), and the post-adjustments
to the active node after following a suffix link could be O(n), too?
No. The reason is that if indeed we have to adjust the active point
(e.g. from green to blue as above), that brings us to a new node that
has its own suffix link, and active_length
will be reduced. As
we follow down the chain of suffix links we make the remaining inserts, active_length
can only
decrease, and the number of active-point adjustments we can make on
the way can't be larger than active_length
at any given time. Since
active_length
can never be larger than remainder
, and remainder
is O(n) not only in every single step, but the total sum of increments
ever made to remainder
over the course of the entire process is
O(n) too, the number of active point adjustments is also bounded by
O(n).
My intuition is as follows:
After k iterations of the main loop you have constructed a suffix tree which contains all suffixes of the complete string that start in the first k characters.
At the start, this means the suffix tree contains a single root node that represents the entire string (this is the only suffix that starts at 0).
After len(string) iterations you have a suffix tree that contains all suffixes.
During the loop the key is the active point. My guess is that this represents the deepest point in the suffix tree that corresponds to a proper suffix of the first k characters of the string. (I think proper means that the suffix cannot be the entire string.)
For example, suppose you have seen characters 'abcabc'. The active point would represent the point in the tree corresponding to the suffix 'abc'.
The active point is represented by (origin,first,last). This means that you are currently at the point in the tree that you get to by starting at node origin and then feeding in the characters in string[first:last]
When you add a new character you look to see whether the active point is still in the existing tree. If it is then you are done. Otherwise you need to add a new node to the suffix tree at the active point, fallback to the next shortest match, and check again.
Note 1: The suffix pointers give a link to the next shortest match for each node.
Note 2: When you add a new node and fallback you add a new suffix pointer for the new node. The destination for this suffix pointer will be the node at the shortened active point. This node will either already exist, or be created on the next iteration of this fallback loop.
Note 3: The canonization part simply saves time in checking the active point. For example, suppose you always used origin=0, and just changed first and last. To check the active point you would have to follow the suffix tree each time along all the intermediate nodes. It makes sense to cache the result of following this path by recording just the distance from the last node.
Can you give a code example of what you mean by "fix" bounding variables?
Health warning: I also found this algorithm particularly hard to understand so please realise that this intuition is likely to be incorrect in all important details...
I tried to implement the Suffix Tree with the approach given in jogojapan's answer, but it didn't work for some cases due to wording used for the rules. Moreover, I've mentioned that nobody managed to implement an absolutely correct suffix tree using this approach. Below I will write an "overview" of jogojapan's answer with some modifications to the rules. I will also describe the case when we forget to create important suffix links.
Additional variables used
Let's use a concept of an internal node - all the nodes, except the root and the leafs are internal nodes.
Observation 1
When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point
and remainder
).
Observation 2
If at some point active_length
is greater or equal to the length of current edge (edge_length
), we move our active point
down until edge_length
is strictly greater than active_length
.
Now, let's redefine the rules:
Rule 1
If after an insertion from the active node = root, the active length is greater than 0, then:
- active node is not changed
- active length is decremented
- active edge is shifted right (to the first character of the next suffix we must insert)
Rule 2
If we create a new internal node OR make an inserter from an internal node, and this is not the first SUCH internal node at current step, then we link the previous SUCH node with THIS one through a suffix link.
This definition of the Rule 2
is different from jogojapan', as here we take into account not only the newly created internal nodes, but also the internal nodes, from which we make an insertion.
Rule 3
After an insert from the active node which is not the root node, we must follow the suffix link and set the active node to the node it points to. If there is no a suffix link, set the active node to the root node. Either way, active edge and active length stay unchanged.
In this definition of Rule 3
we also consider the inserts of leaf nodes (not only split-nodes).
And finally, Observation 3:
When the symbol we want to add to the tree is already on the edge, we, according to Observation 1
, update only active point
and remainder
, leaving the tree unchanged. BUT if there is an internal node marked as needing suffix link, we must connect that node with our current active node
through a suffix link.
Let's look at the example of a suffix tree for cffffdcdc if we add a suffix link in such case and if we don't:
If we DON'T connect the nodes through a suffix link:
If we DO connect the nodes through a suffix link:
Seems like there is no significant difference: in the second case there are two more suffix links. But these suffix links are correct, and one of them - from the blue node to the red one - is very important for our approach with active point. The problem is that if we don't put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the Rule 3
, because, according to it, if there's no a suffix link, then we must put the active_node
to the root.
When we were adding the last letter to the tree, the red node had already existed before we made an insert from the blue node (the edge labled 'c'). As there was an insert from the blue node, we mark it as needing a suffix link. Then, relying on the active point approach, the active node
was set to the red node. But we don't make an insert from the red node, as the letter 'c' is already on the edge. Does it mean that the blue node must be left without a suffix link? No, we must connect the blue node with the red one through a suffix link. Why is it correct? Because the active point approach guarantees that we get to a right place, i.e., to the next place where we must process an insert of a shorter suffix.
Finally, here are my implementations of the Suffix Tree:
Hope that this "overview" combined with jogojapan's detailed answer will help somebody to implement his own Suffix Tree.