Thinking about parsing regular expressions using yacc (I\'m actually using PLY), some of the rules would be like the following:
expr : expr expr
expr : expr
You are under no obligation to use precedence to disambiguate; you can simply write an unambiguous grammar:
term : CHAR | '(' expr ')'
rept : term | term '*' | term '+' | term '?'
conc : rept | conc rept
expr : conc | expr '|' conc
If you really want to use precedence, you can use a "fictitious" token with a %prec
annotation. See the manual for details. (This feature comes from yacc, so you could read about it in any yacc/bison documentation as well.)
Bear in mind that precedence is always a comparison between a production (at the top of the parser stack) and the lookahead token. Normally, the precedence of productions is taken from the precedence of the last terminal in the production (and normally there is only one terminal in each applicable production), so it appears to be a comparison between terminals. But in order to get precedence to work with "invisible" operators, you need to separately consider both the production precedence and the lookahead token precedence.
The precedence of the production can be set with a "fictitious" token, as described above. But there is no lookahead token corresponding to an invisible operator; the lookahead token will be the first token in the following operand. In other words, it could be any token in the FIRST set of expr
, which in this case is {NORMAL, PRIGHT}
; this set must be added to the precedence declaration as though they were concatenation operators:
precedence = (
('left', 'BAR'),
('left', 'CONCAT', 'NORMAL', 'PLEFT'),
('left', 'ASTERISK'),
)
Once you do that, you could economize on the fictitious CONCAT
token, since you could use any of the FIRST(expr)
tokens as a proxy, but that might be considered less readable.