问题
I'm not sure if this grammar is correct for a shell command language that should also be able to execute single-quotes and double-quotes. It seems that non-trivial commands work e.g. ls -al | sort | wc -l
but the simple one does not work with single-quotes: echo 'foo bar'
does not work.
%{
#include "shellparser.h"
%}
%option reentrant
%option noyywrap
%x SINGLE_QUOTED
%x DOUBLE_QUOTED
%%
"|" { return PIPE; }
[ \t\r] { }
[\n] { return EOL; }
[a-zA-Z0-9_\.\-]+ { return FILENAME; }
['] { BEGIN(SINGLE_QUOTED); }
<SINGLE_QUOTED>[^']+ { }
<SINGLE_QUOTED>['] { BEGIN(INITIAL); return ARGUMENT; }
<SINGLE_QUOTED><<EOF>> { return -1; }
["] { BEGIN(DOUBLE_QUOTED); }
<DOUBLE_QUOTED>[^"]+ { }
<DOUBLE_QUOTED>["] { BEGIN(INITIAL); return ARGUMENT; }
<DOUBLE_QUOTED><<EOF>> { return -1; }
[^ \t\r\n|'"]+ { return ARGUMENT; }
%%
My code that scans and parses the shell is
params[0] = NULL;
printf("> ");
i=1;
do {
lexCode = yylex(scanner);
text = strdup(yyget_text(scanner));//yyget_text(scanner);
/*printf("lexCode %d command %s inc:%d", lexCode, text, i);*/
ca = text;
if (lexCode != EOL) {
params[i++] = text;
}
Parse(shellParser, lexCode, text);
if (lexCode == EOL) {
dump_argv("Before exec_arguments", i, params);
exec_arguments(i, params);
corpse_collector();
Parse(shellParser, 0, NULL);
i=1;
}
} while (lexCode > 0);
if (-1 == lexCode) {
fprintf(stderr, "The scanner encountered an error.\n");
}
The CMake build file is
cmake_minimum_required(VERSION 3.0)
project(openshell)
find_package(FLEX)
FLEX_TARGET(ShellScanner shellscanner.l shellscanner.c)
set(CMAKE_VERBOSE_MAKEFILE on)
include_directories(/usr/include/readline)
ADD_EXECUTABLE(lemon lemon.c)
add_custom_command(OUTPUT shellparser.c COMMAND lemon -s shellparser.y DEPENDS shellparser.y)
add_executable(openshell shellparser.c ${FLEX_ShellScanner_OUTPUTS} main.c openshell.h errors.c errors.h util.c util.h stack.c stack.h shellscanner.l shellscanner.h)
file(GLOB SOURCES "./*.c")
target_link_libraries(openshell ${READLINE_LIBRARY} ${FLEX_LIBRARIES})
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Wall -O3 -std=c99")
My project is available on my github. A typical shell session, where only some commands work due to some bug, is as follows.
> ls -al | sort | wc
argument ::= FILENAME .
argumentList ::= argument .
command ::= FILENAME argumentList .
command ::= FILENAME .
command ::= FILENAME .
commandList ::= command .
commandList ::= command PIPE commandList .
commandList ::= command PIPE commandList .
{(null)} {ls} {-al} {|} {sort} {|} {wc}
45 398 2270
3874: child 3881 status 0x0000
in ::= in commandList EOL .
> who
command ::= FILENAME .
commandList ::= command .
{(null)} {who}
dac :0 2016-04-18 05:17 (:0)
dac pts/2 2016-04-18 05:20 (:0)
3874: child 3887 status 0x0000
in ::= in commandList EOL .
> ls -al | awk '{print $1}'
argument ::= FILENAME .
argumentList ::= argument .
command ::= FILENAME argumentList .
argument ::= ARGUMENT .
argumentList ::= argument .
command ::= FILENAME argumentList .
commandList ::= command .
commandList ::= command PIPE commandList .
{(null)} {ls} {-al} {|} {awk} {'}
awk: cmd. line:1: '
awk: cmd. line:1: ^ invalid char ''' in expression
3874: child 3896 status 0x0100
in ::= in commandList EOL .
>
I can observe that both commands get the same bug: echo 'foo bar'
gets garbled to {echo} {'}
when we want it to result in {echo} {foo bar}
so that the shell strips the quotes and executes the command like this
char *cmd[] = { "/usr/bin/echo", "foo bar", 0 };
回答1:
The problem is in rule
<SINGLE_QUOTED>[^']+ { }
since it removes all characters inside quotes. All you get as "yytext" is the closing quote (due to rule <SINGLE_QUOTED>['] ...
). You have to store somewhere the text and use it when the closing quote is detected. E.g. (very poor coding style, error checking etc. omitted, sorry)
<SINGLE_QUOTED>[^']+ { mystring = strdup(yytext); }
<SINGLE_QUOTED>['] { BEGIN(INITIAL);
/* mystring contains the whole string now,
yytext contains only "'" */
return ARGUMENT; }
回答2:
yytext
holds a pointer to the substring which matched the most recently recognized pattern.
So when your scanner returns ARGUMENT
at the end of a single quoted string, yytext
points to the terminating single quote. As it happens, that is visible in your debugging trace.
If you want to "build up" a token, you should take a look at the flex function yymore()
. (And don't forget that the closing single quote is not part of the quoted string.)
Returning ARGUMENT
for both single- and double- quoted strings is both misleading and imprecise.
It is imprecise because a double-quoted string is handled very differently than a single-quoted string, since enclosed substitution syntaxes are expanded, requiring a recursive call to the parser (and this needs to be done even to recognize the end of the string: consider "$(echo "Hello, world!")"
, as one simple example).
It is misleading because the end of the quoted segment does not mark the end of a word. Indeed, a simple-minded scanner will not correctly find wird endings. Consider:
x="a b"
printf "[%s]\n" '$x'$x"$x"
Finally, it is not clear to me why you chose to use lemon rather than bison/yacc since you are not using the one feature which would make it useful in this case: the fact that it implements a "push" interface, allowing you to call the parser from a lexer rule. Of course, modern bison versions -- and even not-so-modern ones -- also implement this feature. Not that I have any bias against lemon -- I think it could be an excellent match for this problem precisely because of the need to do recursive parsing.
来源:https://stackoverflow.com/questions/36696071/is-the-bug-in-the-grammar-or-in-the-code