问题
Given the following language described as:
- formally:
(identifier operator identifier+)*
- in plain English: zero or more operations written as an identifier (the lvalue), then an operator, then one or more identifiers (the rvalue)
An example of a sequence of operations in that language would be, given the arbitrary operator @
:
A @ B C X @ Y
Whitespace is not significant and it may also be written more clearly as:
A @ B C
X @ Y
How would you parse this with a yacc-like LALR parser ?
What I tried so far
I know how to parse explicitly delimited operations, say A @ B C ; X @ Y
but I would like to know if parsing the above input is feasible and how. Hereafter is a (non-functional) minimal example using Flex/Bison.
lex.l:
%{
#include "y.tab.h"
%}
%option noyywrap
%option yylineno
%%
[a-zA-Z][a-zA-Z0-9_]* { return ID; }
@ { return OP; }
[ \t\r\n]+ ; /* ignore whitespace */
. { return ERROR; } /* any other character causes parse error */
%%
yacc.y:
%{
#include <stdio.h>
extern int yylineno;
void yyerror(const char *str);
int yylex();
%}
%define parse.lac full
%define parse.error verbose
%token ID OP ERROR
%left OP
%start opdefs
%%
opright:
| opright ID
;
opdef: ID OP ID opright
;
opdefs:
| opdefs opdef
;
%%
void yyerror(const char *str) {
fprintf(stderr, "error@%d: %s\n", yylineno, str);
}
int main(int argc, char *argv[]) {
yyparse();
}
Build with: $ flex lex.l && yacc -d yacc.y --report=all --verbose && gcc lex.yy.c y.tab.c
The issue: I cannot get the parser to not include the next lvalue identifier to the rvalue of the first operation.
$ ./a.out
A @ B C X @ Y
error@1: syntax error, unexpected OP, expecting $end or ID
The above is always parsed as: reduce(A @ B reduce(C X)) @ Y
I get the feeling I have to somehow put a condition on the lookahead token that says that if it is the operator, the last identifier should not be shifted and the current stack should be reduced:
A @ B C X @ Y
^ * // ^: current, *: lookahead
-> reduce 'A @ B C' !
-> shift 'X' !
I tried all kind of operator precedence arrangements but cannot get it to work.
I would be willing to accept a solution that does not apply to Bison as well.
回答1:
A naïve grammar for that language is LALR(2), and bison does not generate LALR(2) parsers.
Any LALR(2) grammar can be mechanically modified to produce an LALR(1) grammar with a compatible parse tree, but I don't know of any automatic tool which does that.
It's possible but annoying to do the transformation by hand, but be aware that you will need to adjust the actions in order to recover the correct parse tree:
%{
typedef struct IdList { char* id; struct IdList* next; };
typedef struct Def { char* lhs; IdList* rhs; };
typedef struct DefList { Def* def; struct DefList* next; };
%}
union {
Def* def;
DefList* defs;
char* id;
}
%type <def> ophead
%type <defs> opdefs
%token <id> ID
%%
prog : opdefs { $1->def->rhs = IdList_reverse($1->def->rhs);
DefList_show(DefList_reverse($1)); }
ophead: ID '@' ID { $$ = Def_new($1);
$$->rhs = IdList_push($$->rhs, $3); }
opdefs: ophead { $$ = DefList_push(NULL, $1); }
| opdefs ID { $1->def->rhs = IdList_push($1->def->rhs, $2); }
| opdefs ophead { $1->def->rhs = IdList_reverse($1->def->rhs);
$$ = DefList_push($1, $2); }
This precise problem is, ironically, part of bison
itself, because productions do not require a ;
terminator. Bison uses itself to generate a parser, and it solves this problem in the lexer rather than jumping through the loops as outlined above. In the lexer, once an ID
is found, the scan continues up to the next non-whitespace character. If that is a :
, then the lexer returns an identifier-definition
token; otherwise, the non-whitespace character is returned to the input stream, and an ordinary identifier
token is returned.
Here's one way of implementing that in the lexer:
%x SEEK_AT
%%
/* See below for explanation, if needed */
static int deferred_eof = 0;
if (deferred_eof) { deferred_eof = 0; return 0; }
[[:alpha:]][[:alnum:]_]* yylval = strdup(yytext); BEGIN(SEEK_AT);
[[:space:]]+ ; /* ignore whitespace */
/* Could be other rules here */
. return *yytext; /* Let the parser handle errors */
<SEEK_AT>{
[[:space:]]+ ; /* ignore whitespace */
"@" BEGIN(INITIAL); return ID_AT;
. BEGIN(INITIAL); yyless(0); return ID;
<EOF> BEGIN(INITIAL); deferred_eof = 1; return ID;
}
In the SEEK_AT
start condition, we're only interested in @
. If we find one, then the ID was the start of a def
, and we return the correct token type. If we find anything else (other than whitespace), we return the character to the input stream using yyless
, and return the ID
token type. Note that yylval
was already set from the initial scan of the ID
, so there is no need to worry about it here.
The only complicated bit of the above code is the EOF
handling. Once an EOF
has been detected, it is not possible to reinsert it into the input stream, neither with yyless
nor with unputc
. Nor is it legal to let the scanner read the EOF
again. So it needs to be fully dealt with. Unfortunately, in the SEEK_AT
start condition, fully dealing with EOF
requires sending two tokens: first the already detected ID
token, and then the 0 which yyparse
will recognize as end of input. Without a push-parser, we cannot send two tokens from a single scanner action, so we need to register the fact of having received an EOF
, and check for that on the next call to the scanner.
Indented code before the first rule is inserted at the top of the yylex
function, so it can declare local variables and do whatever needs to be done before the scan starts. As written, this lexer is not re-entrant, but it is restartable because the persistent state is reset in the if (deferred_eof)
action. To make it re-entrant, you'd only need to put deferred_eof
in the yystate
structure instead of making it a static local.
回答2:
Following rici's useful comment and answer, here is what I came up with:
lex.l:
%{
#include "y.tab.h"
%}
%option noyywrap
%option yylineno
%%
[a-zA-Z][a-zA-Z0-9_]* { yylval.a = strdup(yytext); return ID; }
@ { return OP; }
[ \t\r\n]+ ; /* ignore whitespace */
. { return ERROR; } /* any other character causes parse error */
%%
yacc.y:
%{
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>
extern int yylineno;
void yyerror(const char *str);
int yylex();
#define STR_OP " @ "
#define STR_SPACE " "
char *concat3(const char *, const char *, const char *);
struct oplist {
char **ops;
size_t capacity, count;
} my_oplist = { NULL, 0, 0 };
int oplist_append(struct oplist *, char *);
void oplist_clear(struct oplist *);
void oplist_dump(struct oplist *);
%}
%union {
char *a;
}
%define parse.lac full
%define parse.error verbose
%token ID OP END ERROR
%start input
%%
opbase: ID OP ID {
char *s = concat3($<a>1, STR_OP, $<a>3);
free($<a>1);
free($<a>3);
assert(s && "opbase: allocation failed");
$<a>$ = s;
}
;
ops: opbase {
$<a>$ = $<a>1;
}
| ops opbase {
int r = oplist_append(&my_oplist, $<a>1);
assert(r == 0 && "ops: allocation failed");
$<a>$ = $<a>2;
}
| ops ID {
char *s = concat3($<a>1, STR_SPACE, $<a>2);
free($<a>1);
free($<a>2);
assert(s && "ops: allocation failed");
$<a>$ = s;
}
;
input: ops {
int r = oplist_append(&my_oplist, $<a>1);
assert(r == 0 && "input: allocation failed");
}
;
%%
char *concat3(const char *s1, const char *s2, const char *s3) {
size_t len = strlen(s1) + strlen(s2) + strlen(s3);
char *s = malloc(len + 1);
if (!s)
goto concat3__end;
sprintf(s, "%s%s%s", s1, s2, s3);
concat3__end:
return s;
}
int oplist_append(struct oplist *oplist, char *op) {
if (oplist->count == oplist->capacity) {
char **ops = realloc(oplist->ops, (oplist->capacity + 32) * sizeof(char *));
if (!ops)
return 1;
oplist->ops = ops;
oplist->capacity += 32;
}
oplist->ops[oplist->count++] = op;
return 0;
}
void oplist_clear(struct oplist *oplist) {
if (oplist->count > 0) {
for (size_t i = 0; i < oplist->count; ++i)
free(oplist->ops[i]);
oplist->count = 0;
}
if (oplist->capacity > 0) {
free(oplist->ops);
oplist->capacity = 0;
}
}
void oplist_dump(struct oplist *oplist) {
for (size_t i = 0; i < oplist->count; ++i)
printf("%2zu: '%s'\n", i, oplist->ops[i]);
}
void yyerror(const char *str) {
fprintf(stderr, "error@%d: %s\n", yylineno, str);
}
int main(int argc, char *argv[]) {
yyparse();
oplist_dump(&my_oplist);
oplist_clear(&my_oplist);
}
Output with A @ B C X @ Y
:
0: 'A @ B C'
1: 'X @ Y'
来源:https://stackoverflow.com/questions/35431147/how-to-reduce-parser-stack-or-unshift-the-current-token-depending-on-what-foll