“Finding all the code in a given binary is equivalent to the Halting problem.” Really?

前端 未结 4 777
野性不改
野性不改 2021-01-02 05:07

Was just reading the highly voted question regarding Emulators and the statement

It\'s been proven that finding all the code in a given binary is e

4条回答
  •  一整个雨季
    2021-01-02 05:45

    I agree with Larsman, and I believe the argument can be made precise. First, I have to agree that "finding all the code in a given binary" means, in this context, having a single computable function that determines which bytes within an input binary correspond to instructions that are executed.

    EDIT: If anyone is not clear on why there is a problem here, think about obfuscated machine code. Disassembly of such code is a non-trivial problem. If you begin disassembly in the middle of a multi-byte instruction, you get the wrong disassembly. This breaks not only the instruction in question, but typically a few instructions beyond that. etc. look into it. (for example, google "obfuscation and disassembly").

    Sketch of strategy to make this precise:

    First, define a theoretical computer / machine model in which a program is encoded in multi-bit binary instructions, much like machine code on "real" computers, but made precise (and such that it is turing complete). Assume that the binary encodes the program and all input. This must all be made precise, but assume that the language has a (binary encoded) halt instruction, and that a program halts if and only if it executes a 'halt' instruction.

    First, let us assume that the machine is not able to alter the program code, but has a memory in which to work. (assumed infinite in interesting cases).

    Then any given bit will either be at the start of an instruction that is run during execution of the program, or it will not. (e.g. it may be in the middle of an instruction, or in data or "junk code" -- that is, code that will never run. Note that I have not claimed these are mutually exclusive, as, for example, a jump instruction can have a target that is in the middle of a particular instruction, thereby creating another instruction that, "on first pass" did not look like an executed instruction.).

    Define ins(i, j) to be the function that returns 1 if the binary i has an instruction beginning at bit-position j, that executes in a run of the program encoded by i. Define ins(i,j) to be 0 otherwise. suppose ins(i,j) is computable.

    Let halt_candidate(i) be the following program:

    for j = 1 to length(i)
      if(i(j..j+len('halt')-1) == 'halt')
        if(ins(i,j) == 1)
          return 1
    return 0
    

    Since we disallow self-modifying code above, they only way for a program to halt is for there to be a 'halt' instruction at some position j that gets executed. By design, halt_candidate(i) computes the halting function.

    This provides a very rough sketch of one direction of the proof. i.e. if we assume we can test whether program i has an instruction at location j for all j, then we can code the halting function.

    For the other direction, one must encode an emulator of the formal machine, within the formal machine. Then, create a program plus inputs i and j which emulates program i and when an instruction at bit position j is executed, it returns 1 and halts. When any other instruction is executed it continues to run, and if the simulation ever simulates a 'halt' function in i, the emulator jumps to an infinite loop. Then ins(i,j) is equivalent to halt(emulator(i,j)), and so the halting problem implies the find code problem.

    Of course one must assume a theoretical computer for this to be equivalent to the famously unsolvable halting problem. Otherwise, for a "real" computer, the halting problem is solvable but intractable.

    For a system that allows self-modifying code the argument is more complex, but I expect not that different.

    EDIT: I believe a proof in the self-modifying case would be to implement an emulator of the system-plus-self-modifying-code in the static code plus data system above. Then halting with self-modifying code allowed for program i with data x, is the same as halting in the static code plus data system above with the binary containing the emulator, i, and x as code plus data.

提交回复
热议问题