I have a very long string. I want to find all the unique substrings of this string. I tried to write the code where I used a set(python) to store all the substrings
Here is some Python code based on a O(n) suffix tree construction to produced the unique substrings from a collection of input strings (the output should appear in sorted order so there is no need to sort the strings afterwards).
As there can be O(n^2) output strings it may take a long time to actually output all the strings.
from collections import defaultdict
class SuffixTree:
def __init__(self):
"""Returns an empty suffix tree"""
self.T=''
self.E={}
self.nodes=[-1]
def add(self,s):
"""Adds the input string to the suffix tree.
This inserts all substrings into the tree.
End the string with a unique character if you want a leaf-node for every suffix.
Produces an edge graph keyed by (node,character) that gives (first,last,end)
This means that the edge has characters from T[first:last+1] and goes to node end."""
origin,first,last = 0,len(self.T),len(self.T)-1
self.T+=s
nc = len(self.nodes)
self.nodes += [-1]*(2*len(s))
T=self.T
E=self.E
nodes=self.nodes
Lm1=len(T)-1
for last_char_index in xrange(first,len(T)):
c=T[last_char_index]
last_parent_node = -1
while 1:
parent_node = origin
if first>last:
if (origin,c) in E:
break
else:
key = origin,T[first]
edge_first, edge_last, edge_end = E[key]
span = last - first
A = edge_first+span
m = T[A+1]
if m==c:
break
E[key] = (edge_first, A, nc)
nodes[nc] = origin
E[nc,m] = (A+1,edge_last,edge_end)
parent_node = nc
nc+=1
E[parent_node,c] = (last_char_index, Lm1, nc)
nc+=1
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last_parent_node = parent_node
if origin==0:
first+=1
else:
origin = nodes[origin]
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last+=1
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
return self
def make_choices(self):
"""Construct a sorted list for each node of the possible continuing characters"""
choices = self.choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
for (origin,c),edge in self.E.items():
choices[origin].append(c)
choices=[sorted(s) for s in choices] # should not have any repeats by construction
return choices
def find_substrings(self,A,term):
"""Recurses through the tree appending unique substrings into A.
Strings assumed to use term as the terminating character"""
choices = self.make_choices()
def f(node,depth):
t=0
for c in choices[node]:
if c==term: continue
first,last,end = self.E[node,c]
# All end points along this edge result in new unique substrings
edge_len = last-first+1
a = first-depth
for b in range(first,last+1):
if self.T[b]!=term:
A.append( self.T[a:b+1] )
f(end,depth+edge_len)
return t
return f(0,0)
def fast_find_all_substrings(strings):
S = SuffixTree()
term = '\0'
for string in strings:
S.add(string+term)
A=[]
S.find_substrings(A,term)
return A
A="abc","abcd","bca"
print fast_find_all_substrings(A)
If you really need it in memory, then you can try making a suffix tree. Tries are not exotic data structures, so there are probably good implementations available for a mainstream language like Python, and they can be used to implement suffix trees. Marisa-Trie is supposed to get good memory usage.