Longest repeated (k times) substring

后端 未结 1 1985
清酒与你
清酒与你 2021-01-04 07:27

I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what\'s already been answered.

This is for the Rosalind project problem L

相关标签:
1条回答
  • 2021-01-04 07:37

    Nice question for an excercise in basic string-operations. I didnt remember the suffix-tree anymore ;) But as you have stated: theory-wise, you are set.

    How do I preprocess the tree with descendant leaves?

    The wikipedia-stub onto this topic is a bit confusing. You only need to know, if you are the outermost non-leaf-node with n >= k childs. If you found the substring from the root-node to this one in the whole string, the suffix-tree tells you, that there are n possible continuitations. So there must be n places, where this string occurs.

    After that, how can I quickly compute depth?

    A simple key-concept of this and many similar problems is to do a depth-first-search: In every Node, ask the child-elements for their value and return the maximum of it to the parent. The root-node will get the final result.

    How the values are calculated differs between the problems. Here you have three possiblilitys for every node:

    1. The node have no childs. Its a leaf-node, the result is invalid.
    2. Every child returns an invalid result. Its the last non-leaf-node, the result is zero (no more characters after this node). If this node have n childs, the concated string of every edge from the root to this node appears n times in the whole string. If we need at least k nodes and k > n, the result is also invalid.
    3. One or more leafs return something valid. The result is the maximum of the returned value plus the length of the string attached the edge to it.

    Of course, you also have to return the correspondending node. Otherwise you will know, how long the longest repeated substring is but not where it is.

    Code

    You should try to code this by yourself first. Constructing the tree is simple but not trivial if you want to gather all necessary informations. Nevertheless here is a simple example. Please note: every sanity-checking is dropped out and everything will fail horribly, if the input is somehow invalid. E.g. do not try to use any other root-index than one, do not refere to nodes as a parent, which weren't referenced as a childs before, etc. Much room for improvement *hint;)*.

    class Node(object):
        def __init__(self, idx):
            self.idx = idx     # not needed but nice for prints 
            self.parent = None # edge to parent or None
            self.childs = []   # list of edges
    
        def get_deepest(self, k = 2):
            max_value = -1
            max_node = None
            for edge in self.childs:
                r = edge.n2.get_deepest()
                if r is None: continue # leaf
                value, node = r
                value += len(edge.s)
                if value > max_value: # new best result
                    max_value = value
                    max_node = node
            if max_node is None:
                # we are either a leaf (no edge connected) or 
                # the last non-leaf.
                # The number of childs have to be k to be valid.
                return (0, self) if len(self.childs) == k else None
            else:
                return (max_value, max_node)
    
        def get_string_to_root(self):
            if self.parent is None: return "" 
            return self.parent.n1.get_string_to_root() + self.parent.s
    
    class Edge(object):
        # creating the edge also sets the correspondending
        # values in the nodes
        def __init__(self, n1, n2, s):
            #print "Edge %d -> %d [ %s]" % (n1.idx, n2.idx, s)
            self.n1, self.n2, self.s = n1, n2, s
            n1.childs.append(self)
            n2.parent = self
    
    nodes = {1 : Node(1)} # root-node
    string = sys.stdin.readline()
    k = int(sys.stdin.readline())
    for line in sys.stdin:
        parent_idx, child_idx, start, length = [int(x) for x in line.split()]
        s = string[start-1:start-1+length]
        # every edge constructs a Node
        nodes[child_idx] = Node(child_idx)
        Edge(nodes[parent_idx], nodes[child_idx], s)
    
    (depth, node) = nodes[1].get_deepest(k)
    print node.get_string_to_root()
    
    0 讨论(0)
提交回复
热议问题