complexity of parsing C++

Out of curiosity, I was wondering what were some "theoretical" results about parsing C++.

Let n be the size of my project (in LOC, for example, but since we'll deal with big-O it's not very important)

Is C++ parsed in O(n) ? If not, what's the complexity?
Is C (or Java or any simpler language in the sense of its grammar) parsed in O(n)?
Will C++1x introduce new features that will be even harder to parse?

References would be greatly appreciated!

I think the term "parsing" is being interpreted by different people in different ways for the purposes of the question.

In a narrow technical sense, parsing is merely verifying the the source code matches the grammar (or perhaps even building a tree).

There's a rather widespread folk theorem that says you cannot parse C++ (in this sense) at all because you must resolve the meaning of certain symbols to parse. That folk theorem is simply wrong.

It arises from the use of "weak" (LALR or backtracking recursive descent) parsers, which, if they commit to the wrong choice of several possible subparse of a locally ambiguous part of text (this SO thread discusses an example), fail completely by virtue of sometimes making that choice. The way those that use such parser resolve the dilemma is collect symbol table data as parsing occurs and mash extra checking into the parsing process to force the parser to make the right choice when such choice is encountered. This works at the cost of significantly tangling name and type resolution with parsing, which makes building such parsers really hard. But, at least for legacy GCC, they used LALR which is linear time on parsing and I don't think that much more expensive if you include the name/type capture that the parser does (there's more to do after parsing because I don't think they do it all).

There are at least two implementations of C++ parsers done using "pure" GLR parsing technology, which simply admits that the parse may be locally ambiguous and captures the multiple parses without comment or significant overhead. GLR parsers are linear time where there are no local ambiguities. They are more expensive in the ambiguity region, but as a practical matter, most the of source text in a standard C++ program falls into the "linear time" part. So the effective rate is linear, even capturing the ambiguities. Both of the implemented parsers do name and type resolution after parsing and use inconsistencies to eliminate the incorrect ambiguous parses. (The two implementations are Elsa and our (SD's) C++ Front End. I can't speak for Elsa's current capability (I don't think it has been updated in years), but ours does all of C++11 [EDIT Jan 2015: now full C++14 EDIT Oct 2017: now full C++17] including GCC and Microsoft variants).

If you take the hard computer science definition that a language is extensionally defined as an arbitrary set of strings (Grammars are supposed to be succinct ways to encode that intensionally) and treating parsing as "check the the syntax of the program is correct" then for C++ you have expand the templates to verify that each actually expands completely. There's a Turing machine hiding in the templates, so in theory checking that a C++ program is valid is impossible (no time limits). Real compilers (honoring the standard) place fixed constraints on how much template unfolding they'll do, and so does real memory, so in practice C++ compilers finish. But they can take arbitrarily long to "parse" a program in this sense. And I think that's the answer most people care about.

As a practical matter, I'd guess most templates are actually pretty simple, so C++ compilers can finish as fast as other compilers on average. Only people crazy enough to write Turing machines in templates pay a serious price. (Opinion: the price is really the conceptual cost of shoehorning complicated things onto templates, not the compiler execution cost.)

Hard to tell if C++ can be "just parsed", as - contrary to most languages - it cannot be analysed syntactically without performing semantic analysis at the same time.

Depends what you mean by "parsed", but if your parsing is supposed to include template instantiation, then not in general:

[Shortcut if you want to avoid reading the example - templates provide a rich enough computational model that instantiating them is, in general, a halting-style problem]

template<int N>
struct Triangle {
    static const int value = Triangle<N-1>::value + N;
};

template<>
struct Triangle<0> {
    static const int value = 0;
};

int main() {
    return Triangle<127>::value;
}

Obviously, in this case the compiler could theoretically spot that triangle numbers have a simple generator function, and calculate the return value using that. But otherwise, instantiating Triangle<k> is going to take O(k) time, and clearly k can go up pretty quickly with the size of this program, as far as the limit of the int type.

[End of shortcut]

Now, in order to know whether Triangle<127>::value is an object or a type, the compiler in effect must instantiate Triangle<127> (again, maybe in this case it could take a shortcut since value is defined as an object in every template specialization, but not in general). Whether a symbol represents an object or a type is relevant to the grammar of C++, so I would probably argue that "parsing" C++ does require template instantiation. Other definitions might vary, though.

Actual implementations arbitrarily cap the depth of template instantiation, making a big-O analysis irrelevant, but I ignore that since in any case actual implementations have natural resource limits, also making big-O analysis irrelevant...

I expect you can produce similarly-difficult programs in C with recursive #include, although I'm not sure whether you intend to include the preprocessor as part of the parsing step.

Aside from that, C, in common with plenty of other languages, can have O(not very much) parsing. You may need symbol lookup and so on, which as David says in his answer cannot in general have a strict O(1) worst case bound, so O(n) might be a bit optimistic. Even an assembler might look up symbols for labels. But for example dynamic languages don't necessarily even need symbol lookup for parsing, since that might be done at runtime. If you pick a language where all the parser needs to do is establish which symbols are keywords, and do some kind of bracket-matching, then the Shunting Yard algorithm is O(n), so there's hope.

来源：https://stackoverflow.com/questions/4172342/complexity-of-parsing-c

标签

c++

parsing

theory

big-o

compiler-theory