How are hex sequence translated to assembly without ambiguity?

前端 未结 9 2051
忘了有多久
忘了有多久 2021-01-06 14:59
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40   

A senquence like above can be segmented in various ways,each segment can be translated to correspon

相关标签:
9条回答
  • 2021-01-06 15:37

    First of all you have to distinguish between RISC and CISC architectures.

    In a RISC architecture you usually have instructions of the same size, so ambiguity cannot be presented. Your CPU will fetch for example 4 bytes for every instruction, and since it will have to start from somewhere (your CPU doesn't have just a sequence like the one you presented, it will have a starting point for sure) once that it has the right alignment no problem can occur.

    What happens with a CISC instruction set is essentially the same: starting from the entry point of the program it will fetch instructions accordingly to your opcodes. It doesn't need to know how to matematically distinguish ambiguities since it won't happen that it just doesn't know how long is the next instruction or where the last one finished.

    So asking how to separate every instruction is like asking how to separate every word in

    thepenisonthetable

    There's not mathematical proof but you know which letters are correct together and which ones are not meaningful. The previous sentence contains "son" but you know that it is obtained from "is on". You wouldn't be able to say so without having a meaningful phrase, but your CPU only executes meaningful programs so what's the point?

    So if the CPU could work on the previous sentence it will find the first senseful instruction "the", then "pen", "is", "on" and the "son" couldn't never be recognized anyway.

    EDIT:

    To be cleared, in CISC architectures, the only contraint you have to be sure not to have ambiguities is to avoid having an instruction that is a prefix of another. Let's assume a finite alphabet composed by letters a-z instead that hex numbers (just for practical purposes).

    If the program counter points to

    abbcbcaabdeffabd
    

    you can have that abb is a whole instruction. In that case ab wouldn't be a valid instruction, otherwise the CPU couldn't know where to stop, at the same time abbc can't be an instruction too or it may create problems. Keeping it on you can have for example that ca is the next instruction, c couldn't and cbc neither.

    You can extend this argumentation to the whole string. You will see that, if the CPU finds itself in a state in which the next byte of the binary points to the FIRST byte of an instruction, and there are no instruction that are prefixes of other instruction, then in the next state the program counter will point to the FIRST byte of the next, correct, instruction.

    0 讨论(0)
  • 2021-01-06 15:37

    It sounds like the answer to your question is the somewhat flippant "Know your starting point", but maybe you want something a little more verbose. Given your string:

    8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
    

    AND a starting point (Let's say the 8B is your starting point) there is only one possible interpretation of the bytes.

    So let's say one operation is 8B EC 56 8B (Depending on your operation length, etc), then the NEXT operation is F4 68... In this case, it's impossible for the machine to try to interpret an operation 56 8B F4 68 because no operation ended at just that point.

    Now, if your start point was the 56, then you can get that group but cannot get either of the ones mentioned previously.

    The layout of your memory is very specific and start points/jump points are exact and unforgiving--they are required as surely as the code itself.

    0 讨论(0)
  • 2021-01-06 15:39

    If you open a binary in a hex editor, copy a portion of data and paste in a disassembler, you will very probably not copy a complete instruction. Let's use your example.. in a Windows XP 32bits SP3 English, if I assembly 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40 I'll get:

    Hex dump          Command
    8BEC              mov ebp,esp
    56                push esi
    8BF4              mov esi,esp
    68 00704000       push 407000
    FF15 BC824000     call dword ptr ds:[4082bc]
    

    As you can see it assembled completely different then the answer of the other guys below...

    Now let's pretend that instead of assembling 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
    you added C0 opcode at the beginning C0 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

    Now check below what a single byte did in our instructions:

    Hex dump             Command
    C08B EC568BF4 68     ror byte ptr ds:[ebx+f48b56ec],68
    0070 40              add byte ptr ds:[eax+40],dh
    00FF                 add bh,bh
    15 BC824000          adc eax,4082bc
    

    As you can see it was completely modified!

    The processor know where to start and what arguments to the instruction to take by the opcode instruction. In the first example our first opcode was 8B so it knows that it can be followed by another byte. If this byte is EC so the instruction ends here, and it means mov ebp, esp.

    In our second example it start with C0 and it can be followed by another byte, meaning another instruction. Then C08B is the instruction, and EC568BF4 68 is the argument.

    Imagine that inside the processor has a huge (but nano) circuit, that behave like a chain of ifs (conditions), that depending on the value of the hexcode (or opcode) it knows "where to go" or "how to behave".

    0 讨论(0)
  • 2021-01-06 15:39

    Dont confuse linearly trying to disassemble with execution order of code. The binary is decoded in execution order, starting with a known location. Other than intentional hacks for various reasons there is no ambiguity.

    Try writing a disassembler for a variable word length instruction set. At the end of the day it has to be done in execution order, and even there you can only disassemble some of the program as some branches can be based on addresses computed at run time. Modern compiler generated code is much better than older hand assembled code. In an old standup arcade game for example there are conditional branches preceeded by an instruction that guarantees that only one of the conditions is met (why was that in there? we will never know) and the data that follows the conditional branch resembles an opcode in such a way that you run into a collision with other opcodes.

    Old dos programs trying to defeat disassemblers would have self modifying code, one instruction would modify another instruction one or two instructions ahead, if single stepped that modification would happen, but if run at full speed the instruction was already fetched in the pipeline, and the modified/broken one in memory was not used. pretty neat trick.

    Anyway, the answer to your question is do not look at the bytes in linear order, look at them in execution order starting at the addresses defined by the reset and other vectors in the vector table.

    0 讨论(0)
  • 2021-01-06 15:39

    Maybe you find it interesting to think about the other direction: How would you have to design your code to be easy to segment for others? You could require the most significant bit of the byte starting a sequence to be zero, and those in the middle of a sequence to be one, like UTF-8 does it. Then if you start from a random position – assuming you know where the bytes are – it is easy to find the next sequence. Going one step further, how would you code a pure bit stream such that the start of a sequence is easy to find. How many bits were wasted by such a coding?

    Since you asked about the maths, I think the relevant topics are “Coding Theory”, “Variable-length codes” or “Prefix codes”.

    How do you find a gene in a sequence of base pairs?

    0 讨论(0)
  • 2021-01-06 15:47

    If I understand your question correctly, you're trying to understand why

    8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

    Could be split e.g. as

    8BEC 568BF4 68007040 00FF 15BC 8240

    Rathern than say,

    8B EC568B F4 68007040 00FF 15BC 8240

    This is entirely specified by the ISA of your architecture. That document describes exactly how instructions are uniquely constructed from a series of bytes.

    For the ISA to be well formed, a single series of bytes can correspond to at most a single series of decoded instructions (might be less, if there are invalid instructions).

    To get a bit more concrete, lets take the x86 example: If you want to know what each byte corresponds to, have a look here.

    You'll see that, e.g. an instruction starting with 00 is an add (additional parameters are in the next byte, with a specific encoding).

    You'll also see that some values are actually prefixes that modify the following instruction (0F - prefix to extend the opcode space, 26, 2E, 36, 3E, 64, 65, 66, 67, F0, F2, F3), and that some of them take different meaning based on the exact following instruction. Those are not opcodes, but they can alter the encoding of the arguments of the opcode, or introduce a completely new opcode space (e.g. SSE uses 0F).

    Overall, the x86 encoding is very complex, thanks for disassemblers.

    0 讨论(0)
提交回复
热议问题