How to build a AST for a proprietary language?

问题

I m trying to understand how to build a AST for a proprietary language. I need to build a AST so I can feed in my rules and guidelines to check for the possible errors in the source code.

How does one go about building a AST? Are there any books, articles that may help me get started. Will the dragon book on compilers help?.

Please note i'm from a non-CS background.

Thanks

回答1:

This is a pretty large question. I do feel your pain, as I also tackled this problem from a non-CS background. It kind of made me appreciate CS a lot more.

One thing you will probably see a lot of in your research is Extended Backus-Naur Form (EBNF). It's basically a way of describing your language. Creating an EBNF for your language will help you wrap your head around what the computer will need to do to parse it, it will help.

Getting back to the problem at hand: You will probably be using a lexer/parser to build your tree.

The traditional tools to use to do that are lex and yacc, or their somewhat more modern cousins flex and bison.

A newer approach is that of Antlr. It comes highly recommended, but was over my head.

A third approach I found is Python's pyparsing library. It's the one I ultimately went with due to my familiarity with Python and the readable way it describes what you need it to parse.

There are plenty of examples available for pyparsing, which helped. The one I found most helpful building my parser was SimpleCalc. However, it is based on a fairly old version of pyparsing, and it is more complex than it needs to be with some of the powerful operations that pyparsing later implemented. SimpleArith is a similar, but newer version.

One thing I haven't actually handled yet with pyparsing is properly analyzing syntax errors. It seems like it provides the necessary tools for you to do so, however.

Anyway this isn't really a complete answer to your question, but I hope it at least points you at a few places to start. Building a parser for a complex language isn't easy!

回答2:

Code analysis engines generally require quite a lot of sophistication above and beyond just building ASTs.

To do any serious code analysis, you need to know the meaning of identifiers in code and where/how they are defined ("symbol tables"), and you often need to know how information travels around the program (control and data flow analysis). You need machinery to support all of these, and then you need to tie that machinery to your proprietary language.

I think of climbing Everest as an analogy. Getting ASTs is like getting to the 10,000 foot base camp. Any clod can do that by just walking up the hill using basic technology (hiking boots). Climbing the last 17,000 feet requires a whole different kind of technology, commitment, and plan, and most folks, having walked up the first 10,000 feet, are simply unprepared for the rest of the trip. (I have some experience here, check my bio).

These are all pretty detailed topics and your absence of CS background is going to make the road for you likely pretty rough. (We all start somewhere, however, so this is really a matter of ambition). The Dragon book is an excellent resource that will help you understand what all this machinery does and why you need it; many other fine compiler books exist and will generally serve just as well. But you need to be prepared to do some serious reading.

One way to get up the curve is to use a tool in which much of this machinery has been already thought out and implemented for you, by a bunch of computer scientists experienced at building such tools. Then your problem is considerably reduced: you only need to learn how to use what they provided, rather than trying to figure out what you need (most folks never get past the AST stage) and implement all the necessary support machinery.

ANTLR (already mentioned, done by a pretty good CS professor) is sort of one, in that it provides parsing capabilities, enables you to define how ASTs are built, and how to climb over the resulting ASTs procedurally. But it doesn't provide much else you need for your task.

Our DMS Software Reengineering Toolkit provides all the facilities I mentioned in the first paragraph, and then some. One of the first differences you will notice working with DMS is that you only need to provide it grammar; it will builds ASTs without any further help from you.

You can get a flavor of what working with DMS is like, at this example of DMS applied to high school algebra and calculus. In particular, it shows how using just a simple grammar for algebra/calculus can be easily defined, and then "programs" in that language can be manipulated. This application is one that "transforms" code rather than analyzes it, but the basics are the same.

A "real" DMS application that analyzed your proprietary langauge will be considerably more complex.

来源：https://stackoverflow.com/questions/4477400/how-to-build-a-ast-for-a-proprietary-language

标签

static-analysis

abstract-syntax-tree