How can one create a data flow graph (DFG/SDFG) for any application from its source code

狂风中的少年 提交于 2019-12-01 20:18:07
Ira Baxter

You want to extract data flow graphs from arbitrary languages. You imply you want a single way to do it. This isn't practical by hand... you need a tool.

Such a tool is singularly hard to build.

To do this, for each language you must be able to:

  • Define the language to the tool, in the form you find it in practice (not just the language reference manual version). C++ in the wild is bent in a lot of funny ways compared to the standard.
  • Parse programs in the language as found in the field, perhaps as one file, perhaps as tens of thousands; some programs aren't small.
  • Build structures representing the language elements and their relations to one another (this is often done as an abstract syntax tree)
  • Determine for each literal what its actual value is; "a\xbc" has very different values depending on whether the language thinks it is ascii or unicode text with escape sequences
  • Find all the identifiers in the code, and determine for each one which definitional/type information is associated with it according the language scoping rules
  • Determine the sources of data (literal values, inputs from the outside world, results of expressions) and track where those data values are used in other parts of the program across various control flow constructs
  • Presumably draw some picture of the resulting data flow.

Each of these tasks by themselves are hard, because the languages tend to be complex. Most language tools that can do this at all (mostly compilers) do it only for one dialect of the language.

To do it for more than one language/dialect, you need a tool that can be configured for all the details for each language, and you have to configure for all the languages of interest. [Realistically you cannot "do them all"; there's thousands of computer languages in use right now].

Even limiting yourself to the "everyday" common programming languages, this is an enormous amount of work; it can take a few years to do all this well for a single mainstream language. You won't succeed in doing this by yourself.

My company builds a single, unified tool that is intended to be capable of doing this: the DMS Software Reengineering Toolkit. The simple "secret" is to realize that the machinery needed to accomplish the above tasks is actually very similar across languages, and can be designed to configured for a particular language with relatively modest (that doesn't mean "small") effort.

After 20 linear years of engineering with a team of PhD level engineers, we have parsers (even this is hard) for a surprising variety of languages, with full up data flow analyzers of the type you are talking about for C++ (check this link for examples), C, COBOL and almost Java 8.

I don't know of any other unified tools that are this far down the path toward your ideal. Check my bio before you decide I'm clueless about this. (Rascal/MPL has some ambitions but is a research tool at this point; they don't do C or C++ at all) We're only part way there, with many languages and scale battles to fight remaining.

[The goal of DMS isn't data flow analysis; that's just a stepping stone. It is to do automated code transformation, which requires data flow analysis to do safely and correctly].

Of course, you could just hope to find a separate tool for each language. You wouldn't get consistent quality or consistent style/granularity of data flow graphs from separate tools from different authors, if you can indeed get a full set of such tools at all.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!