Find simplest regular expression matching all given strings

后端 未结 2 1790
不思量自难忘°
不思量自难忘° 2021-01-11 20:40

Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings th

相关标签:
2条回答
  • 2021-01-11 21:15

    There is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction (|) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)

    The procedure is simple, if you have access to standard FSM algorithms:

    1. Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly O(N) in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.

    2. Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is O(N) in size, but I don't have a proof off the top of my head that it is also O(N) in time.

    3. Minimize the DFA.

    DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case O(NS log N) algorithm, where N is the number of states in the DFA and S is the size of the alphabet. Normally, S would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.

    For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.

    0 讨论(0)
  • 2021-01-11 21:18

    You can try to use Aho-Corasick algorithm to create a finite state machine from the input strings, after which it should be somewhat easy to generate the simplified regex. Your input strings as example:

    h_q1_a
    h_q1_b
    h_q1_c
    h_p2_a
    h_p2_b
    h_p2_c
    

    will generate a finite machine that most probably look like this:

          [h_]         <-level 0
         /   \
      [q1]  [p2]       <-level 1
         \   /
          [_]          <-level 2
          /\  \
         /  \  \
        a    b  c      <-level 3
    

    Now for every level/depth of the trie all the stings (if multiple) will go under OR brackets, so

    h_(q1|p2)_(a|b|c)
    L0   L1  L2  L3
    
    0 讨论(0)
提交回复
热议问题