I am writing a tool which uses libbfd
and libopcodes
in x86-32 and x86-64 Linux to perform disassembly. The problem is that whilst I am able to get
Libopcodes prints disassembled instructions into the stream which is intercepted by your custom_printf function. Your mistake is that you assume that custom_printf is called once each time a single instruction is disassembled, however, it is called more often, particularly, to print each mnemonic,operand, address or separator.
So, resulting disassembly of your binary is
xor %ebp, %ebp
mov %rdx, %r9
pop %rsi
mov %rsp, %rdx
and $0xfffffffffffffff0, %rsp
push %rax
push %rsp
mov $0x401450,%r8
...
Unfortunately, as of binutils libopcodes 2.22, insn_type
is not filled in on either i386 or x86_64. The only widespread supported architectures are MIPS, Sparc, and the Cell’s SPU. This is still true as of current CVS HEAD.
It's hard to prove that something does not exist, but for instance, in the Sparc disassembler source you can see several occurrences of insn_type
being set, for instance info->insn_type = dis_branch
, whereas in the i386 disassembler source there are no occurrences of insn_type
nor any of the values it would be expected to have (dis_branch
, dis_nonbranch
etc.).
Checking for all the libopcodes files that support insn_type
you get:
opcodes/mips-dis.c
opcodes/spu-dis.c
opcodes/microblaze-dis.c
opcodes/cris-dis.c
opcodes/sparc-dis.c
opcodes/mmix-dis.c
Doing this with just those libraries is going to be an extremely painful and arduous process. I think you should listen to Necrolis and use a library that already does this. I've used the Dyninst in the past (namely, the InstructionAPI + ParseAPI). They're very well documented, and will do exactly what you're trying to do. At the very least, spending an hour with this library and compiling their examples in the manuals will give you an application that will let you examine things like the opcodes of each instruction, length of each instruction, number of arguments to each instruction, etc. These are things that libopcodes does not tell you nor handle (it decodes addresses at a time, which aren't guaranteed to be instructions).
Here's a snippet from the developers of Opdis that I took from their manual (which I would suggest reading if you haven't, lots of good stuff in there about libopcodes
):
The libopcodes library is a very serviceable disassembler, but it has three shortcomings:
- it is under-documented, making it difficult for new users to understand
- its feature set is limited to the disassembly of a single address
- it is designed mainly to print disassembled instructions to a stream
Among other things, I think you might be getting stung by the second item in that list. Namely, the fact that most (all?) opcodes would fit into a single address and would agree with the observed output (e.g., you're getting the mov
and pop
and some register arguments). But what about tricky things like variable length instructions or instructions that aren't lining up exactly at the 4-byte boundaries? You're not doing anything to handle those.
The disassembly generated by libopcodes is a sequence of strings intended for writing to a stream. There is no metadata, so the strings must be examined to determine which are mnemonics and which are operands, and which of these are branch/jump/return instructions and what their targets are.
I'm guessing that Opdis is smarter than your program -- it knows how and what to look for in the stream. Perhaps sometimes it knows that it needs to read two addresses instead of just one before disassembling. From your code, and the description of libopcodes, neither is doing this.
Good luck! Remember to read that manual, and perhaps consider using libopdis
instead!