Any differences between terms parse trees and derivation trees?

问题

The terms AST (Abstract Syntax Tree), parse tree and derivation tree are bandied about by different people when referring to the result of parsing texts conforming to a grammar. Assuming we are talking about parsing computer languages, are their differences minute enough that we can use these terms interchangeably ? If not, how do we use the terms correctly ?

回答1:

AFAIK, "derivation tree" and "parse tree" are the same.

Abstract syntax tree

In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is 'abstract' in the sense that it does not represent every detail that appears in the real syntax.

Parse tree

A concrete syntax tree or parse tree or parsing tree is an (ordered, rooted) tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the grammar.

Take the source a = (1 + 2) * 3; for example. The parse tree could look like:

    ASSIGNMENT
   / / |      \
  / /  |       \ 
 a = expression ;
       /   \
 expression \ 
   / | \     \
  (  +  )     *
    / \        \
   1   2        3

while the abstract syntax tree could look like:

ASSIGNMENT
  /    \
 a   expression 
      /     \
 expression  *
     |        \ 
     +         3 
    / \
   1   2

回答2:

I would use the term parse tree when the tree is produced by parsing, that is when evaluating if a given input sequence belongs to the language and determining which productions must be used in which order to regenerate the sequence.

A derivation tree would have exactly the same shape, but would be produced by the process of deriving a sequence from a given production.

The formal definition of parsing is finding a derivation for the given input sequence, so no wonder derivation and parse trees are the same.

Concrete versus abstract syntax trees differ in that the former has a leaf node for each token in the input sequence, while the latter omits any tokens that can be known by inspecting the grammar. A concrete syntax subtree for if <expr> then <statement> else <statement> end would have leafs for if, then, else, and end, and the abstract one would not. The concrete syntax tree for (2+3) would be:

  e
  |
( e )
 /|\        
| | |  
n + n

The abstract one would be just:

  +
 | |  
 n n

回答3:

Parse/Derivation/Concrete syntax trees are all synonyms for the same concept.

Such trees are normally only used in theory discussions, because they contains lots of details that seem unnecessary for doing langauge processing; in an expression tree, do you really need a node to represent "(" and another to represent ")"?

The notion of an "abstract syntax" tree is one which represents the program structure to a level of detail that is adequate for processing in practice; you don't typically find nodes for "(...)".

An interesting question: is an AST directly computable from a CST? The answer is should be yes, but people hardly ever do it. What they tend to do is construct "abstract syntax" nodes as the parser runs, and use ad hoc (rule reduction procedural attachment) to assemble the nodes from child parses with a glue node for the parent. IMHO, they do this because we've all been brought up on YACC, and that's how it is traditionally done. (We used to light fires with flint, too.) There's a lesser excuse; doing it this way gives the compiler-builder complete control of the structure of the AST and he can produce one that is pretty minimal in terms of extra detail. Such an ad-hoc tree is not computable from the CST, except by the same ad-hoc computation that is embedded in the parser actions.

I've used a different approach: my tools compute AST directly from CSTs, literally by dropping irrelevant details, e.g., leaving out nodes that represent non-value bearing tokens (e.g., those pointless '(' ')' tokens as well as keywords), compressing out strings of unary productions, and converting right- or left-leaning trees equivalent to lists, into real list nodes. The advantage to doing this is the parser can compute the AST directly from the grammar rules. No fiddling around with procedural attachments. No getting it wrong. No more worrying about the fact that our COBOL grammar has 3500 rules and I would otherwise need procedural goo for every one of them, and that I have to change my grammar hundreds of times to get it right and fiddle with the goo every time. And our tools work as though they operate directly on the CST, which makes it easy to think about tree manipulations, especially if you are staring directly at the grammar rules. (This also makes pattern matching using surface syntax much easier: for any pattern fragment, there's a directly computable AST that corresponds).

So the distinction between AST and CST is real in terms of utility. But I think they should be considered as just isomorphic representations.

来源：https://stackoverflow.com/questions/5729961/any-differences-between-terms-parse-trees-and-derivation-trees

标签

parsing

grammar

lex

abstract-syntax-tree