On complexity of recursive descent parsers

问题

It's known that recursive descent parsers may require exponential time in some cases; could anyone point me to the samples, where this happens? Especially interested in cases for PEG (i.e. with prioritized choices).

回答1:

It's because you can end up parsing the same things (check the same rule at the same position) many times in different recursion branches. It's kind of like calculating the n-th Fibonacci number using recursion.

Grammar:

A -> xA | xB | x
B -> yA | xA | y | A
S -> A

Input:
xxyxyy

Parsing:
xA(xxyxyy)
    xA(xyxyy)
        xA(yxyy) fail
        xB(yxyy) fail
        x(yxyy) fail
    xB(xyxyy)
        yA(yxyy)
            xA(xyy)
                xA(yy) fail
                xB(yy) fail
                x(yy) fail
            xB(xyy)
                yA(yy)
                    xA(y) fail
                    xB(y) fail
                    x(y) fail
                xA(yy) fail *
            x(xyy) fail
        xA(yxyy) fail *
        y(yxyy) fail
        A(yxyy)
            xA(yxyy) fail *
            xB(yxyy) fail *
            x(yxyy) fail *
    x(xyxyy) fail
xB(xxyxyy)
    yA(xyxyy) fail
    xA(xyxyy) *
        xA(yxyy) fail *
        xB(yxyy) fail *
        ...

* - where we parse a rule at the same position where we have already parsed it in a different branch. If we had saved the results - which rules fail at which positions - we'd know xA(xyxyy) fails the second time around and we wouldn't go through its whole subtree again. I didn't want to write out the whole thing, but you can see it will repeat the same subtrees many times.

When it will happen - when you have many overlapping transformations. Prioritized choice doesn't change things - if the lowest priority rule ends up being the only correct one (or none are correct), you had to check all the rules anyway.

回答2:

Any top down parser, including recursive descent, can theoretically become exponential if the combination of input and grammar are such that large numbers of backtracks are necessary. This happens if the grammar is such that determinative choices are placed at the end of long sequences. For example, if you have a symbol like & meaning "all previous minuses are actually plusses" and then have data like "((((a - b) - c) - d) - e &)" then the parser has to go backwards and change all the plusses to minuses. If you start making nested expressions along these lines you can create an effectively non-terminating set of input.

You have to realize you are stepping into a political issue here, because the reality is that most normal grammars and data sets are not like this, however, there are a LOT of people who systematically badmouth recursive descent because it is not easy to make RD automatically. All early parsers are LALR because they are MUCH easier to make automatically than RD. So what happened was that everyone just wrote LALR and badmouthed RD, because in the old days the only way to make an RD was to code it by hand. For example, if you read the dragon book you will find that Aho & Ullman write just one paragraph on RD, and it is basically just a ideological takedown saying "RD is bad, don't do it".

Of course, if you start hand coding RDs (as I have) you will find that they are much better than LALRs for a variety of reasons. In the old days you could always tell a compiler that had a hand-coded RD, because it had meaningful error messages with locational accuracy, whereas compilers with LALRs would show the error occurring like 50 lines away from where it really was. Things have changed a lot since the old days, but you should realize that when you start reading the FUD on RD, that it is coming from a long, long tradition of verbally trashing RD in "certain circles".

来源：https://stackoverflow.com/questions/16869453/on-complexity-of-recursive-descent-parsers

标签

algorithm

parsing

language-agnostic