Instrumenting C/C++ codes using LLVM

前端 未结 2 1842
执笔经年
执笔经年 2021-01-30 02:43

I just read about the LLVM project and that it could be used to do static analysis on C/C++ codes using the analyzer Clang which the front end of LLVM. I wanted to know if it is

2条回答
  •  时光说笑
    2021-01-30 03:05

    First off, you have to decide whether you want to work with clang or LLVM. They both operate on very different data structures which have advantages and disadvantages.

    From your sparse description of your problem, I'll recommend going for optimization passes in LLVM. Working with the IR will make it much easier to sanitize, analyze and inject code because that's what it was designed to do. The downside is that your project will be dependent on LLVM which may or may not be a problem for you. You could output the result using the C backend but that won't be usable by a human.

    Another important downside when working with optimization passes is that you also lose all symbols from the original source code. Even if the Value class (more on that later) has a getName method, you should never rely on it to contain anything meaningful. It's meant to help you debug your passes and nothing else.

    You will also have to have a basic understanding of compilers. For example, it's a bit of a requirement to know about basic blocks and static single assignment form. Fortunately they're not very difficult concepts to learn or understand (the Wikipedia articles should be adequate).

    Before you can start coding, you first have to do some reading so here's a few links to get you started:

    • Architecture Overview: A quick architectural overview of LLVM. Will give you a good idea of what you're working with and whether LLVM is the right tool for you.

    • Documentation Head: Where you can find all the links below and more. Refer to this if I missed anything.

    • LLVM's IR reference: This is the full description of the LLVM IR which is what you'll be manipulating. The language is relatively simple so there isn't too much to learn.

    • Programmer's manual: A quick overview of basic stuff you'll need to know when working with LLVM.

    • Writting Passes: Everything you need to know to write transformation or analysis passes.

    • LLVM Passes: A comprehensive list of all the passes provided by LLVM that you can and should use. These can really help clean up the code and make it easier to analyze. For example, when working with loops, the lcssa, simplify-loop and indvar passes will save your life.

    • Value Inheritance Tree: This is the doxygen page for the Value class. The important bit here is the inheritance tree that you can follow to get the documentation for all the instructions defined in the IR reference page. Just ignore the ungodly monstrosity that they call the collaboration diagram.

    • Type Inheritance Tree: Same as above but for types.

    Once you understand all that then it's cake. To find memory accesses? Search for store and load instructions. To instrument? Just create what you need using the proper subclass of the Value class and insert it before or after the store and load instruction. Because your question is a bit too broad, I can't really help you more than this. (See correction below)

    By the way, I had to do something similar a few weeks ago. In about 2-3 weeks I was able to learn all I needed about LLVM, create an analysis pass to find memory accesses (and more) within a loop and instrument them with a transformation pass I created. There was no fancy algorithms involved (except the ones provided by LLVM) and everything was pretty straightforward. Moral of the story is that LLVM is easy to learn and work with.


    Correction: I made an error when I said that all you have to do is search for load and store instructions.

    The load and store instruction will only give accesses that are made to the heap using pointers. In order to get all memory accesses you also have to look at the values which can represent a memory location on the stack. Whether the value is written to the stack or stored in a register is determined during the register allocation phase which occurs in an optimization pass of the backend. Meaning that it's platform dependent and shouldn't be relied on.

    Now unless you provide more information about what kind of memory accesses you're looking for, in what context and how you intend to instrument them, I can't help you much more then this.

提交回复
热议问题