Any tools can randomly generate the source code according to a language grammar?

问题

A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.

I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.

Is there any tool that can do this?

回答1:

The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.

In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.

回答2:

Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.

For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.

The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.

回答3:

If you have a model of the grammar in a normalized form (all rules like this):

 LHS = RHS1 RHS2 ...  RHSn ;

and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.

Simply start with the goal symbol as a unit tree.

  Repeat until no nonterminals are left:
    Pick a nonterminal N in the tree;
       Expand by adding children for the right hand side of any rule
       whose left-hand side matches the nonterminal N

For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.

A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.

Then prettyprint the result. Its the prettyprinter part that is hard to get right.

[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].

A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.

回答4:

the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices

there is lot more on this subject in: Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963) (see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)

来源：https://stackoverflow.com/questions/4468086/any-tools-can-randomly-generate-the-source-code-according-to-a-language-grammar

标签

random

compiler-construction

context-free-grammar