How to find all the unique substrings of a very long string?

后端 未结 2 853
无人及你
无人及你 2020-12-21 16:13

I have a very long string. I want to find all the unique substrings of this string. I tried to write the code where I used a set(python) to store all the substrings

相关标签:
2条回答
  • 2020-12-21 16:54

    Here is some Python code based on a O(n) suffix tree construction to produced the unique substrings from a collection of input strings (the output should appear in sorted order so there is no need to sort the strings afterwards).

    As there can be O(n^2) output strings it may take a long time to actually output all the strings.

    from collections import defaultdict
    
    class SuffixTree:
        def __init__(self):
            """Returns an empty suffix tree"""
            self.T=''
            self.E={}
            self.nodes=[-1]
    
        def add(self,s):
            """Adds the input string to the suffix tree.
    
            This inserts all substrings into the tree.
            End the string with a unique character if you want a leaf-node for every suffix.
    
            Produces an edge graph keyed by (node,character) that gives (first,last,end)
            This means that the edge has characters from T[first:last+1] and goes to node end."""
            origin,first,last = 0,len(self.T),len(self.T)-1
            self.T+=s
            nc = len(self.nodes)
            self.nodes += [-1]*(2*len(s))
            T=self.T
            E=self.E
            nodes=self.nodes
    
            Lm1=len(T)-1
            for last_char_index in xrange(first,len(T)):
                c=T[last_char_index]
                last_parent_node = -1                    
                while 1:
                    parent_node = origin
                    if first>last:
                        if (origin,c) in E:
                            break             
                    else:
                        key = origin,T[first]
                        edge_first, edge_last, edge_end = E[key]
                        span = last - first
                        A = edge_first+span
                        m = T[A+1]
                        if m==c:
                            break
                        E[key] = (edge_first, A, nc)
                        nodes[nc] = origin
                        E[nc,m] = (A+1,edge_last,edge_end)
                        parent_node = nc
                        nc+=1  
                    E[parent_node,c] = (last_char_index, Lm1, nc)
                    nc+=1  
                    if last_parent_node>0:
                        nodes[last_parent_node] = parent_node
                    last_parent_node = parent_node
                    if origin==0:
                        first+=1
                    else:
                        origin = nodes[origin]
    
                    if first <= last:
                        edge_first,edge_last,edge_end=E[origin,T[first]]
                        span = edge_last-edge_first
                        while span <= last - first:
                            first+=span+1
                            origin = edge_end
                            if first <= last:
                                edge_first,edge_last,edge_end = E[origin,T[first]]
                                span = edge_last - edge_first
    
                if last_parent_node>0:
                    nodes[last_parent_node] = parent_node
                last+=1
                if first <= last:
                        edge_first,edge_last,edge_end=E[origin,T[first]]
                        span = edge_last-edge_first
                        while span <= last - first:
                            first+=span+1
                            origin = edge_end
                            if first <= last:
                                edge_first,edge_last,edge_end = E[origin,T[first]]
                                span = edge_last - edge_first
            return self
    
        def make_choices(self):
            """Construct a sorted list for each node of the possible continuing characters"""
            choices = self.choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
            for (origin,c),edge in self.E.items():
                choices[origin].append(c)
            choices=[sorted(s) for s in choices] # should not have any repeats by construction
            return choices
    
        def find_substrings(self,A,term):
            """Recurses through the tree appending unique substrings into A.
            Strings assumed to use term as the terminating character"""
            choices = self.make_choices()
            def f(node,depth):
                t=0
                for c in choices[node]:
                    if c==term: continue
                    first,last,end = self.E[node,c]
                    # All end points along this edge result in new unique substrings
                    edge_len = last-first+1
                    a = first-depth
                    for b in range(first,last+1):
                        if self.T[b]!=term:
                            A.append( self.T[a:b+1] )
                    f(end,depth+edge_len)
                return t
            return f(0,0)
    
    def fast_find_all_substrings(strings):
        S = SuffixTree()
        term = '\0'
        for string in strings:
            S.add(string+term)
        A=[]
        S.find_substrings(A,term)
        return A
    
    A="abc","abcd","bca"
    print fast_find_all_substrings(A)
    
    0 讨论(0)
  • 2020-12-21 17:04

    If you really need it in memory, then you can try making a suffix tree. Tries are not exotic data structures, so there are probably good implementations available for a mainstream language like Python, and they can be used to implement suffix trees. Marisa-Trie is supposed to get good memory usage.

    1. Create an empty trie.
    2. For each n in [0, len(s)], add the suffix of length n to the Trie.
    3. Every path from the root of the trie is a substring in the string, there are no such paths that are not substrings in the string, and paths are unique.
    0 讨论(0)
提交回复
热议问题