Decompile C code with debug info?

问题

Java and Python byte code are relatively easy to decompile than compiled machine code generated by C/C++ compiler.

I am unable to find a convincing answer as to why the information from the -g option is insufficient for de-compilation, but sufficient for debugging? What is the extra stuff contained in Python/Java byte code, that makes decompilation easy?

回答1:

I am unable to find a convincing answer as to why the information from the -g option is insufficient for de-compilation, but sufficient for debugging?

The debugging information basically contains only mapping between the addresses in the generated code and the source files line numbers. The debugger does not need to decompile code - it just shows you the original sources. If the source files are missing, debugger won't magically show them.

That said, presence of debugging info does make decompilation easier. If the debug info includes the layout of the used types and function prototypes, the decompiler can use it and provide a much more precise decompilation. In many cases, however, it will still likely be different from the original source.

For example, here's a function decompiled with the Hex-Rays decompiler without using the debug info:

int __stdcall sub_4050A0(int a1)
{
  int result; // eax@1

  result = a1;
  if ( *(_BYTE *)(a1 + 12) )
  {
    result = sub_404600(*(_DWORD *)a1);
    *(_BYTE *)(a1 + 12) = 0;
  }
  return result;
}

Since it does not know the type of a1, the accesses to its fields are represented as additions and casts.

And here's the same function after the symbol file has been loaded:

void __thiscall mytree::write_page(mytree *this, PAGE *src)
{
  if ( src->isChanged )
  {
    cache::set_changed(this->cache, src->baseAddr);
    src->isChanged = 0;
  }
}

You can see that it's been improved quite a lot.

As for why decompiling bytecode is usually easier, in addition to NPE's answer check also this.

回答2:

Here are some of the reasons for this:

Java and Python bytecodes are relatively simple and high-level, whereas the instruction set of some CPUs (think x86) is fiendishly complicated.
The bytecodes closely mimic the structure of the language for which they've been designed.
When generating bytecodes, Java and Python perform do very little by way of optimization. This results in bytecodes that closely correspond to the structure of the original source code. A good optimizing C or C++ compiler is capable of producing assembly that's far removed from the original source code.
There are few Java and Python compilers, and many C and C++ compilers. It's easier to produce a high-quality decompiler if you are targetting a single known compiler (or a small set of known compilers).
Python and Java are relatively simple languages compared to C++ (this point doesn't apply to C).
C++ templates present many challenges to quality decompilation (this point also doesn't apply to C).
The C/C++ preprocessor.
In Python, there is a one-to-one relationship between source files and bytecode files. In Java, the relatioship is one source to one or more bytecode files. In C and C++, the relationship is many-to-many, with a lot of overlap on the source front (think headers).

回答3:

Some processors, like x86 ones, have instructions of variable length. If control is passed into the middle (= anywhere after the first byte) of an instruction, that can be a valid instruction (or several instructions) too. This makes it hard to unambiguously disassemble machine code. C/C++ code can exploit this feature.

On some processors and OSes it is possible to execute data as if it were code and use code as if it were data. This makes it hard to unambiguously separate the two. And, again, this is what C/C++ programs can often do easily.

On some processors and OSes it's easy to generate code on the fly and execute it and it's possible to modify the existing code at run time. This too contributes to ambiguities in decompiling code. And C/C++ programs can often do this as well.

EDIT: Also, some CPUs have multiple different encodings for the same instruction. For example, x86 CPUs have 2 instructions mov reg, reg/mem and mov reg/mem, reg. These let you move data between a register and a memory location (in either direction) and between two registers. Both of these instructions can be used to transfer data between two registers, but they have different encodings. If the program somehow relies on a particular encoding (e.g. for the purpose of validating its integrity via checksums), then from the disassembly like mov eax, ebx you wouldn't be able to tell which of the two mov instructions it originally was and so if you attempt to reassemble the disassembly, you may break the program.

You can use the debugger to debug a program with or without debug/symbol information. This information only makes it easier for the human to navigate the code and data since many (but not necessarily all) routines and variables can be identified and shown using their names and types and not just raw addresses and raw typeless data.

I'm guessing that the various bytecodes are less ambiguous and more restricted in what they can do and that's what makes it easier to decompile those.

来源：https://stackoverflow.com/questions/15609440/decompile-c-code-with-debug-info

标签

java

c++

reverse-engineering

decompiling