Algorithm to generate context free grammar from any regex

Can anyone outline for me an algorithm that can convert any given regex into an equivalent set of CFG rules?

I know how to tackle the elementary stuff such as (a|b)*:

S -> a A
S -> a B
S -> b A
S -> b B
A -> a A
A -> a B
A -> epsilon
B -> b A
B -> b B
B -> epsilon
S -> epsilon (end of string)

However, I'm having some problem formalizing it into a proper algorithm especially with more complex expressions that can have many nested operations.

If you are just talking about regular expressions from a theoretical point of view, there are these three constructs:

ab       # concatenation
a|b      # alternation
a*       # repetition or Kleene closure

What you could then just do:

create a rule S -> (fullRegex)
for every repeated term (x)* in fullRegex create a rule X -> x X and X -> ε, then replace (x)* with X.
for every alternation (a|b|c) create rules Y -> a, Y -> b and Y -> c, then replace (a|b|c) with Y

Simply repeat this recursively (note that all x, a, b and c can still be complex regular expressions). Note that of course you have to use unique identifiers for every step.

This should be enough. This will certainly not give the most elegant or efficient grammar, but that is what normalization is for (and it should be done in a separate step and there are well-defined steps to do this).

One example: a(b|cd*(e|f)*)*

S -> a(b|cd*(e|f)*)*

S -> a X1; X1 -> (b|cd*(e|f)*) X1; X1 -> ε

S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> cd*(e|f)*

S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> c X2 (e|f)*; X2 -> d X2; X2 -> ε

... and a few more of those steps, until you end up with:

S  -> a X1
X1 -> Y1 X1
X1 -> ε
Y1 -> b
Y1 -> c X2 X3
X2 -> d X2
X2 -> ε
X3 -> Y2 X3
X3 -> ε
Y2 -> e
Y2 -> f

来源：https://stackoverflow.com/questions/13139821/algorithm-to-generate-context-free-grammar-from-any-regex

标签

regex

algorithm

nlp

context-free-grammar

computation-theory