CFG for python-style tuples

问题

After having read for the zillionth time a question about "How do I parse HTML with Regex" on Stackoverflow, I got myself interested again in grammars, grabbed my university scripts and after a few minutes I wondered how I've ever passed my exams.

As a simple (well, "simple" I expected it to be) exercise I tried to write a CFG that produces valid python tuples (for simplicity's sake only using the identifiers a, b and c). After some good time I now came up with this:

G = ( {Tuple, TupleItem, Id}, {“a”, “b”, “c”, “,”, “(“, “)”}, P, Tuple)

Being P:

Tuple → “(“ TupleItem “)”
Tuple → “(“ TupleItem Id “)”
Tuple → “(“ TupleItem Tuple “)”
TupleItem → TupleItem TupleItem
TupleItem → Id “,”
TupleItem → Tuple “,”
Id → “a”
Id → “b”
Id → “c”

This grammar is supposed to produce e.g. (a,), (a,b), (a,b,), ((a,),), ((a,b,),(a,),), but not (,a), (), ,, (a,b c) etc. I do not want to produce superfluous parentheses like ((a),) or ((a,b)). Actually the sometimes optional (when more than one item) and sometimes obligatory (when only one item) trailing comma almost killed me.

Does this grammar produce all valid python tuples (using only a, b and c)?
Does this grammar produce strings that are not valid python tuples?
Is this grammar proper? (I am unsure about the cyclic criterion)
Why is my grammar so freaking long? How can I reduce the number of production rules? (Not by using syntactic sugar like pipes, as those only put several rules onto one line.)

Thanks in advance for your comments and answers.

回答1:

Without actually referring to the Python grammar, I'm pretty sure that your grammar produces all valid Python tuples except one ((), the empty tuple), and that it doesn't produce anything which is not a Python tuple. So to that extent, it's fine.

However, it's not much use for parsing because

TupleItem → TupleItem TupleItem

is exponentially ambiguous. (Dicho sea de paso, TupleItem is not a very descriptive name for this non-terminal, which is really a list.) Ambiguous grammars are "proper" in the sense that they obey all the rules for context-free grammars, but unambiguous grammars are usually better.

It's easy to fix:

Tuple → “(“ “)”
Tuple → “(“ ItemList “,” “)”
Tuple → “(“ ItemList “,” Item “)”
ItemList → Item
ItemList → ItemList “,” Item
Item → Id
Item → Tuple

(I left out the Id productions; in practical grammars, Id would be a terminal, but it makes little difference.)

Finally, why is this grammar "so long"? (Is seven productions really "so freaking long?"? Depends on your criteria, I guess.)

The simple answer is that CFGs are like that. You could add syntactic sugar to make the right-hand sides regular expressions (not just alternation, but also Kleene star and its companions):

Tuple → “(“ [ ItemList “,” Item? ]? “)”
ItemList → Item // “,”
Item → Id | Tuple

Here I use the useful interpolate operator //, which is rarely taught in academic classes and consequently has surprisingly few implementations:

a // b =_def a(ba)^*

Whether or not the above is easier to read, I leave to the reader. It's similar to the EBNF (Extended Backus-Naur Form) commonly used in grammar expositions, particularly in RFCs. (EBNF is one of the few formalisms with an interpolate operator, although its not written as explicitly as mine.)

Anyway, other than that, I don't believe that your grammar can be trimmed.

来源：https://stackoverflow.com/questions/18797248/cfg-for-python-style-tuples

标签

python

tuples

grammar

context-free-grammar