How do I determine the number of x86 machine instructions executed in a C program?

前端 未结 4 1731
感情败类
感情败类 2020-11-27 22:47

I\'m currently working on a homework problem that asks me to find out the number of machine code instructions that are executed when running a short program I wrote in C.

相关标签:
4条回答
  • 2020-11-27 22:58

    One way of doing this could be to manually instrument each instruction with a counting instruction. There are several ways of doing this -

    1. You could modify the Instruction emitter part of any open source compiler (gcc/LLVM) to emit a counting instruction before every instruction. I can add to the answer the exact way of doing this in LLVM if you are interested. But I believe that the second method I am giving here will be easier to implement and will work across most compilers.

    2. You can instrument the instructions post compilation. Most compilers provide the option to generate readable assembly instead of the object files. The flag for gcc/clang is -S. For the following program

    #include <stdio.h>
    int main_real(int argc, char* argv[]) {
        printf("hello world\n");
        return 0;
    }
    

    my compiler produces the following .s file -

        .section    __TEXT,__text,regular,pure_instructions
        .build_version macos, 10, 14
        .globl  _main_real                  ## -- Begin function main
        .p2align    4, 0x90
    _main_real:                                  ## @main_real
        .cfi_startproc
    ## %bb.0:
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        subq    $32, %rsp
        leaq    L_.str(%rip), %rax
        movl    $0, -4(%rbp)
        movl    %edi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movq    %rax, %rdi
        movb    $0, %al
        callq   _printf
        xorl    %ecx, %ecx
        movl    %eax, -20(%rbp)         ## 4-byte Spill
        movl    %ecx, %eax
        addq    $32, %rsp
        popq    %rbp
        retq
        .cfi_endproc
                                            ## -- End function
        .section    __TEXT,__cstring,cstring_literals
    L_.str:                                 ## @.str
        .asciz  "hello world\n"
    
    
    .subsections_via_symbols
    

    It is easy to see here that everything that starts with <tab> not followed by a . is an instruction.

    Now we have to simple program that finds all such instructions and instrument them. You can do this easily with perl. But before we actually instrument the code, we have to figure out an appropriate instrumenting instruction. This will depend a lot on the architecture and the target operating system. So I will provide an example for X86_64.

    It is clear why we need to instrument BEFORE the instructions rather than AFTER them, so as to also count the branching instructions.

    Assuming a global variables __r13_save and __instruction_counter initialized to zero, we can insert the instruction -

    movq %r13, __r13_save(%rip)
    movq __instruction_counter(%rip), %r13
    leaq 1(%r13), %r13
    movq %r13, __instruction_counter(%rip)
    movq %r13, __r13_save(%rip)
    

    As you can see we have used the rip relative addressing mode, which should be fine for most programs that a beginner writes (bigger programs might have issues). We have used leaq here instead of incq to avoid clobbering the flags that are used by the program for control flow. (As suggested by @PeterCordes in the comments.)

    This instrumentation also works correctly for single threaded programs since we are using a global counter for instructions and stashing away the %r13 register. For extending the above for multithreaded program, one will have to use thread local storage and instrument the thread creation functions too.

    Also, the variables __r13_save and __instruction_counter are frequently accessed and should always be in the L1 cache, making this instrumentation not that costly.

    Now to instrument the instructions we use perl as -

    cat input.s | perl -pe 's/^(\t[^.])/\tmovq %r13, __r13_save(%rip)\n\tmovq __instruction_counter(%rip), %r13\n\tleaq 1(%r13), %r13\n\tmovq %r13, __instruction_counter(%rip)\n\tmovq %r13, __r13_save(%rip)\n\1/' > output.s
    

    For the above sample program this generates

        .section    __TEXT,__text,regular,pure_instructions
        .build_version macos, 10, 14
        .globl  _main_real              ## -- Begin function main_real
        .p2align    4, 0x90
    _main_real:                             ## @main_real
        .cfi_startproc
    ## %bb.0:
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        subq    $32, %rsp
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        leaq    L_.str(%rip), %rax
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        movl    %edi, -4(%rbp)
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        movq    %rsi, -16(%rbp)
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        movq    %rax, %rdi
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        movb    $0, %al
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        callq   _printf
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        xorl    %ecx, %ecx
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        movl    %eax, -20(%rbp)         ## 4-byte Spill
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        movl    %ecx, %eax
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        addq    $32, %rsp
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        popq    %rbp
        movq %r13, __r13_save(%rip)
        movq __instruction_counter(%rip), %r13
        leaq 1(%r13), %r13
        movq %r13, __instruction_counter(%rip)
        movq %r13, __r13_save(%rip)
        retq
        .cfi_endproc
                                            ## -- End function
        .section    __TEXT,__cstring,cstring_literals
    L_.str:                                 ## @.str
        .asciz  "hello world\n"
    
    
    .subsections_via_symbols
    

    Now we also need to create this variable somewhere. This can be done by creating a simple c wrapper.c as -

    #include <stdio.h>
    long long int __instruction_counter;
    long long int __r13_save;
    int main_real(int, char* []);
    int main(int argc, char* argv[]) {
        int ret = main_real(argc, argv);
        printf("Total instructions = %lld\n", __instruction_counter);
        return ret;
    }
    

    You might see the function main_real. So in your actual program you have to create a main_real instead of main.

    Finally link everything up as -

    clang output.s wrapper.c -o a.out
    

    and execute your program. Your code should run normally and print the instruction count before it exits.

    You might have to take care of name mangling of the __instruction_counter variable. For some ABIs the compiler adds an extra _ at the beginning. In that case you will have to add an extra _ to the perl command. You can check the exact name for the variable by also generating the assembly for the wrapper.

    On running the above example I get -

    hello world
    Total instructions = 15
    

    Which matches the exact number of instruction our function has. You might have noticed that this counts only the number of instructions in the code you have written and compiled. Not in the printf function for instance. That is usually a difficult problem to tackle with static instrumentation.

    One caveat here is that your program has to exit "normally" i.e. by returning from main. If it calls exit or abort, you will not be able to see the instruction count. You can also provide an instrumented version of exit and abort to solve that problem.

    With a compiler based approach this can be made more efficient by adding a single addq instruction for each basic block with the parameter being the number of instruction that BB has, since once the control flow enters a basic block, it is bound to go through it.

    0 讨论(0)
  • 2020-11-27 23:04

    You can use Godbolt's Compiler Explorer to compile your program and display the assembly code for various compilers and options.

    Then count the number of instructions for every fragment, ie: sequence of statements upto and including the first test.

    Then instrument you code: add a global variable instruction_count, initialized to the number of instructions in the main function epilog and increment this variable at the beginning of each fragment by the number of instructions you counted in the previous step. and print this number just before returning from the main function.

    You will get the number of instructions that would be executed by the uninstrumented program for whatever input is provided to the program, for a given combination of architecture, compiler and options, but not including instructions executed in library functions nor during the startup and exit phases.

    0 讨论(0)
  • 2020-11-27 23:21

    As I mentioned in my top comments, one way to do this is to write a program that feeds commands to gdb.

    Specifically, the si command (step ISA instruction).

    I couldn't get this to work with pipes, but I was able to get it to work by putting gdb under a pseudo-tty.

    Edit: After thinking about it, I came up with a version that uses ptrace directly on the target program instead of sending commands to gdb. It is much faster [100x faster] and [probably] more reliable


    So, here's the gdb based control program. Note that this must be linked with -lutil.

    // gdbctl -- gdb control via pseudo tty
    
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <time.h>
    #include <fcntl.h>
    #include <errno.h>
    #include <poll.h>
    #include <pty.h>
    #include <utmp.h>
    #include <sys/types.h>
    #include <sys/wait.h>
    
    int opt_d;                              // 1=show debug output
    int opt_e;                              // 1=echo gdb output
    int opt_f;                              // 1=set line buffered output
    int opt_x;                              // si repetition factor
    
    int zpxlvl;                             // current trace level
    
    int ptypar;                             // parent PTY fd
    int ptycld;                             // child PTY fd
    char name[100];                         // child PTY device name
    
    unsigned long long sicount;             // single step count
    
    const char *gdb = "(gdb) ";             // gdb's prompt string
    const char *waitstr;                    // currently active "wait for" string
    char *waitstop[8] = { NULL };           // string that shows run is done
    
    int stopflg;                            // 1=waitstop seen
    char sicmd[100];
    
    char waitbuf[10000];                    // large buffer to scan for strings
    char *waitdst = waitbuf;                // current end position
    
    pid_t pidgdb;                           // gdb's pid
    pid_t pidfin;                           // stop pid
    int status;                             // gdb's final status
    double tvelap;                          // start time
    
    #ifndef _USE_ZPRT_
    #define _USE_ZPRT_      1
    #endif
    
    static inline int
    zprtok(int lvl)
    {
    
        return (_USE_ZPRT_ && (opt_d >= lvl));
    }
    
    #define dbg(_lvl,_fmt...) \
        do { \
            if (zprtok(_lvl)) \
                printf(_fmt); \
        } while (0)
    
    // tvgetf -- get high precision time
    double
    tvgetf(void)
    {
        struct timespec ts;
        double sec;
    
        clock_gettime(CLOCK_REALTIME,&ts);
        sec = ts.tv_nsec;
        sec /= 1e9;
        sec += ts.tv_sec;
    
        return sec;
    }
    
    // xstrcat -- concatenate a string
    char *
    xstrcat(char *dst,const char *src)
    {
        int chr;
    
        for (chr = *src++;  chr != 0;  chr = *src++)
            *dst++ = chr;
    
        *dst = 0;
    
        return dst;
    }
    
    // gdbexit -- check for gdb termination
    void
    gdbexit(void)
    {
    
        // NOTE: this should _not_ happen
        do {
            if (pidgdb == 0)
                break;
    
            pidfin = waitpid(pidgdb,&status,WNOHANG);
            if (pidfin == 0)
                break;
    
            pidgdb = 0;
            printf("gdbexit: WAITPID status=%8.8X\n",status);
            exit(8);
        } while (0);
    }
    
    // gdbwaitpoll -- wait for prompt string
    void
    gdbwaitpoll(const char *buf)
    {
        char *cp;
        char **wstr;
    
        do {
            gdbexit();
    
            if (waitstr == NULL)
                break;
    
            // concatenate to big buffer
            dbg(2,"BUF '%s'\n",buf);
            waitdst = xstrcat(waitdst,buf);
    
            // check for final termination string (e.g. "exited with")
            for (wstr = waitstop;  *wstr != NULL;  ++wstr) {
                cp = *wstr;
                dbg(2,"TRYSTOP '%s'\n",cp);
                cp = strstr(waitbuf,cp);
                if (cp != NULL) {
                    stopflg = 1;
                    waitstop[0] = NULL;
                }
            }
    
            // check for the prompt (e.g. "(gdb) ")
            cp = strstr(waitbuf,waitstr);
            if (cp == NULL)
                break;
    
            dbg(1,"HIT on '%s'\n",waitstr);
    
            // got it reset things
            waitbuf[0] = 0;
            waitdst = waitbuf;
            waitstr = NULL;
        } while (0);
    }
    
    // gdbrecv -- process input from gdb
    void
    gdbrecv(void)
    {
        struct pollfd fds[1];
        struct pollfd *fd = &fds[0];
        int xlen;
        char buf[1000];
    
        fd->fd = ptypar;
        fd->events = POLLIN;
    
        while (1) {
            gdbexit();
    
    #if 1
            int nfd = poll(fds,1,1);
            if (nfd <= 0) {
                if (waitstr != NULL)
                    continue;
                break;
            }
    #endif
    
            // get a chunk of data
            xlen = read(ptypar,buf,sizeof(buf) - 1);
            dbg(1,"gdbrecv: READ xlen=%d\n",xlen);
    
            if (xlen < 0) {
                printf("ERR: %s\n",strerror(errno));
                break;
            }
    
            // wait until we've drained every bit of data
            if (xlen == 0) {
                if (waitstr != NULL)
                    continue;
                break;
            }
    
            // add EOS char
            buf[xlen] = 0;
    
            dbg(1,"ECHO: ");
            if (opt_e)
                fwrite(buf,1,xlen,stdout);
    
            // wait for our prompt
            gdbwaitpoll(buf);
        }
    }
    
    // gdbwaitfor -- set up prompt string to wait for
    void
    gdbwaitfor(const char *wstr,int loopflg)
    {
    
        waitstr = wstr;
        if (waitstr != NULL)
            dbg(1,"WAITFOR: '%s'\n",waitstr);
    
        while ((waitstr != NULL) && loopflg && (pidgdb != 0))
            gdbrecv();
    }
    
    // gdbcmd -- send command to gdb
    void
    gdbcmd(const char *str,const char *wstr)
    {
        int rlen = strlen(str);
        int xlen = 0;
    
    #if 0
        printf("CMD/%d: %s",rlen,str);
    #endif
    
        gdbwaitfor(wstr,0);
    
        for (;  rlen > 0;  rlen -= xlen, str += xlen) {
            gdbexit();
    
            xlen = write(ptypar,str,rlen);
            if (xlen <= 0)
                break;
    
            dbg(1,"RET: rlen=%d xlen=%d\n",rlen,xlen);
            gdbrecv();
        }
    
        dbg(1,"END/%d\n",xlen);
    }
    
    // gdbctl -- control gdb
    void
    gdbctl(int argc,char **argv)
    {
    
        // this is the optimal number for speed
        if (opt_x < 0)
            opt_x = 100;
    
        if (opt_x <= 1) {
            opt_x = 1;
            sprintf(sicmd,"si\n");
        }
        else
            sprintf(sicmd,"si %d\n",opt_x);
    
        // create pseudo TTY
        openpty(&ptypar,&ptycld,name,NULL,NULL);
    
        pidgdb = fork();
    
        // launch gdb
        if (pidgdb == 0) {
            //sleep(1);
    
            login_tty(ptycld);
            close(ptypar);
    
            char *gargs[8];
            char **gdst = gargs;
    
            *gdst++ = "gdb";
            *gdst++ = "-n";
            *gdst++ = "-q";
            *gdst++ = *argv;
            *gdst = NULL;
    
            execvp(gargs[0],gargs);
            exit(9);
        }
    
        // make input from gdb non-blocking
    #if 1
        int flags = fcntl(ptypar,F_GETFL,0);
        flags |= O_NONBLOCK;
        fcntl(ptypar,F_SETFL,flags);
    #endif
    
        // wait
        char **wstr = waitstop;
        *wstr++ = "exited with code";
        *wstr++ = "Program received signal";
        *wstr++ = "Program terminated with signal";
        *wstr = NULL;
    
        printf("TTY: %s\n",name);
        printf("SI: %d\n",opt_x);
        printf("GDB: %d\n",pidgdb);
    
    #if 1
        sleep(2);
    #endif
    
        gdbwaitfor(gdb,1);
    
        // prevent kill or quit commands from hanging
        gdbcmd("set confirm off\n",gdb);
    
        // set breakpoint at earliest point
    #if 1
        gdbcmd("b _start\n",gdb);
    #else
        gdbcmd("b main\n",gdb);
    #endif
    
        // skip over target program name
        --argc;
        ++argv;
    
        // add extra arguments
        do {
            if (argc <= 0)
                break;
    
            char xargs[1000];
            char *xdst = xargs;
            xdst += sprintf(xdst,"set args");
    
            for (int avidx = 0;  avidx < argc;  ++avidx, ++argv) {
                printf("XARGS: '%s'\n",*argv);
                xdst += sprintf(xdst," %s",*argv);
            }
    
            xdst += sprintf(xdst,"\n");
    
            gdbcmd(xargs,gdb);
        } while (0);
    
        // run the program -- it will stop at the breakpoint we set
        gdbcmd("run\n",gdb);
    
        // disable the breakpoint for speed
        gdbcmd("disable\n",gdb);
    
        tvelap = tvgetf();
    
        while (1) {
            // single step an ISA instruction
            gdbcmd(sicmd,gdb);
    
            // check for gdb aborting
            if (pidgdb == 0)
                break;
    
            // check for target program exiting
            if (stopflg)
                break;
    
            // advance count of ISA instructions
            sicount += opt_x;
        }
    
        // get elapsed time
        tvelap = tvgetf() - tvelap;
    
        // tell gdb to quit
        gdbcmd("quit\n",NULL);
    
        // wait for gdb to completely terminate
        if (pidgdb != 0) {
            pidfin = waitpid(pidgdb,&status,0);
            pidgdb = 0;
        }
    
        // close PTY units
        close(ptypar);
        close(ptycld);
    }
    
    // main -- main program
    int
    main(int argc,char **argv)
    {
        char *cp;
    
        --argc;
        ++argv;
    
        for (;  argc > 0;  --argc, ++argv) {
            cp = *argv;
            if (*cp != '-')
                break;
    
            switch (cp[1]) {
            case 'd':
                cp += 2;
                opt_d = (*cp != 0) ? atoi(cp) : 1;
                break;
    
            case 'e':
                cp += 2;
                opt_e = (*cp != 0) ? atoi(cp) : 1;
                break;
    
            case 'f':
                cp += 2;
                opt_f = (*cp != 0) ? atoi(cp) : 1;
                break;
    
            case 'x':
                cp += 2;
                opt_x = (*cp != 0) ? atoi(cp) : -1;
                break;
            }
        }
    
        if (argc == 0) {
            printf("specify target program\n");
            exit(1);
        }
    
        // set output line buffering
        switch (opt_f) {
        case 0:
            break;
    
        case 1:
            setlinebuf(stdout);
            break;
    
        default:
            setbuf(stdout,NULL);
            break;
        }
    
        gdbctl(argc,argv);
    
        // print statistics
        printf("%llu instructions -- ELAPSED: %.9f -- %.3f insts / sec\n",
            sicount,tvelap,(double) sicount / tvelap);
    
        return 0;
    }
    

    Here's a sample test program:

    // tgt -- sample slave/test program
    
    #include <stdlib.h>
    #include <unistd.h>
    #include <signal.h>
    
    int opt_S;
    
    int glob;
    
    void
    dumb(int x)
    {
    
        glob += x;
    }
    
    int
    spin(int lim)
    {
        int x;
    
        for (x = 0;  x < lim;  ++x)
            dumb(x);
    
        return x;
    }
    
    int
    main(int argc,char **argv)
    {
        char *cp;
        int lim;
        int *ptr;
        int code;
    
        --argc;
        ++argv;
    
        for (;  argc > 0;  --argc, ++argv) {
            cp = *argv;
            if (*cp != '-')
                break;
    
            switch (cp[1]) {
            case 'S':
                opt_S = cp[2];
                break;
            }
        }
    
        switch (opt_S) {
        case 'f':  // cause segfault
            ptr = NULL;
            *ptr = 23;
            code = 91;
            break;
    
        case 'a':  // abort
            abort();
            code = 92;
            break;
    
        case 't':  // terminate us
            signal(SIGTERM,SIG_DFL);
    #if 0
            kill(getpid(),SIGTERM);
    #else
            raise(SIGTERM);
    #endif
            code = 93;
            break;
    
        default:
            code = 0;
            break;
        }
    
        if (argc > 0)
            lim = atoi(argv[0]);
        else
            lim = 10000;
    
        lim = spin(lim);
        lim &= 0x7F;
        if (code == 0)
            code = lim;
    
        return code;
    }
    

    Here's a version that uses ptrace that is much faster than the version that uses gdb:

    // ptxctl -- control via ptrace
    
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <time.h>
    //#include <fcntl.h>
    #include <errno.h>
    #include <sys/types.h>
    #include <sys/wait.h>
    #include <sys/ptrace.h>
    #include <sys/user.h>
    
    int opt_d;                              // 1=show debug output
    int opt_e;                              // 1=echo progress
    int opt_f;                              // 1=set line buffered output
    
    unsigned long long sicount;             // single step count
    
    int stopflg;                            // 1=stop seen
    
    pid_t pidtgt;                           // gdb's pid
    pid_t pidfin;                           // stop pid
    
    int status;                             // target's final status
    char statbuf[1000];                     // status buffer
    int coredump;                           // 1=core dumped
    
    int zpxlvl;                             // current trace level
    
    int regsidx;                            // regs index
    struct user_regs_struct regs[2];        // current regs
    
    #define REGSALL(_cmd) \
        _cmd(r15) \
        _cmd(r14) \
        _cmd(r13) \
        _cmd(r12) \
        _cmd(rbp) \
        _cmd(rbx) \
        _cmd(r11) \
        _cmd(r10) \
        _cmd(r9) \
        _cmd(r8) \
        _cmd(rax) \
        _cmd(rcx) \
        _cmd(rdx) \
        _cmd(rsi) \
        _cmd(rdi) \
        _cmd(orig_rax) \
        /*_cmd(rip)*/ \
        _cmd(cs) \
        _cmd(eflags) \
        _cmd(rsp) \
        _cmd(ss) \
        _cmd(fs_base) \
        _cmd(gs_base) \
        _cmd(ds) \
        _cmd(es) \
        _cmd(fs) \
        _cmd(gs)
    
    #define REGSDIF(_reg) \
        if (cur->_reg != prev->_reg) \
            printf("  %16.16llX " #_reg "\n",cur->_reg);
    
    double tvelap;                          // start time
    
    #ifndef _USE_ZPRT_
    #define _USE_ZPRT_      1
    #endif
    
    static inline int
    zprtok(int lvl)
    {
    
        return (_USE_ZPRT_ && (opt_d >= lvl));
    }
    
    #define dbg(_lvl,_fmt...) \
        do { \
            if (zprtok(_lvl)) \
                printf(_fmt); \
        } while (0)
    
    // tvgetf -- get high precision time
    double
    tvgetf(void)
    {
        struct timespec ts;
        double sec;
    
        clock_gettime(CLOCK_REALTIME,&ts);
        sec = ts.tv_nsec;
        sec /= 1e9;
        sec += ts.tv_sec;
    
        return sec;
    }
    
    // ptxstatus -- decode status
    char *
    ptxstatus(int status)
    {
        int zflg;
        int signo;
        char *bp;
    
        bp = statbuf;
        *bp = 0;
    
        // NOTE: do _not_ use zprtok here -- we need to force this on final
        zflg = (opt_d >= zpxlvl);
    
        do {
            if (zflg)
                bp += sprintf(bp,"%8.8X",status);
    
            if (WIFSTOPPED(status)) {
                signo = WSTOPSIG(status);
                if (zflg)
                    bp += sprintf(bp," WIFSTOPPED signo=%d",signo);
    
                switch (signo) {
                case SIGTRAP:
                    break;
                default:
                    stopflg = 1;
                    break;
                }
            }
    
            if (WIFEXITED(status)) {
                if (zflg)
                    bp += sprintf(bp," WIFEXITED code=%d",WEXITSTATUS(status));
                stopflg = 1;
            }
    
            if (WIFSIGNALED(status)) {
                signo = WTERMSIG(status);
                if (zflg)
                    bp += sprintf(bp," WIFSIGNALED signo=%d",signo);
    
                if (WCOREDUMP(status)) {
                    coredump = 1;
                    stopflg = 1;
                    if (zflg)
                        bp += sprintf(bp," -- core dumped");
                }
            }
        } while (0);
    
        return statbuf;
    }
    
    // ptxcmd -- issue ptrace command
    long
    ptxcmd(enum __ptrace_request cmd,void *addr,void *data)
    {
        long ret;
    
        dbg(zpxlvl,"ptxcmd: ENTER cmd=%d addr=%p data=%p\n",cmd,addr,data);
        ret = ptrace(cmd,pidtgt,addr,data);
        dbg(zpxlvl,"ptxcmd: EXIT ret=%ld\n",ret);
    
        return ret;
    }
    
    // ptxwait -- wait for target to be stopped
    void
    ptxwait(const char *reason)
    {
    
        dbg(zpxlvl,"ptxwait: %s pidtgt=%d\n",reason,pidtgt);
        pidfin = waitpid(pidtgt,&status,0);
    
        // NOTE: we need this to decide on stop status
        ptxstatus(status);
    
        dbg(zpxlvl,"ptxwait: %s status=(%s) pidfin=%d\n",
            reason,statbuf,pidfin);
    }
    
    // ptxwhere -- show where we are
    void
    ptxwhere(int initflg)
    {
        struct user_regs_struct *cur;
        struct user_regs_struct *prev;
    
        do {
            prev = &regs[regsidx];
    
            if (initflg) {
                ptxcmd(PTRACE_GETREGS,NULL,prev);
                break;
            }
    
            regsidx = ! regsidx;
            cur = &regs[regsidx];
    
            ptxcmd(PTRACE_GETREGS,NULL,cur);
            printf("RIP: %16.16llX (%llu)\n",cur->rip,sicount);
    
            if (opt_e < 2)
                break;
    
            REGSALL(REGSDIF);
        } while (0);
    }
    
    // ptxctl -- control ptrace
    void
    ptxctl(int argc,char **argv)
    {
    
        pidtgt = fork();
    
        // launch target program
        if (pidtgt == 0) {
            pidtgt = getpid();
            ptxcmd(PTRACE_TRACEME,NULL,NULL);
            execvp(argv[0],argv);
            exit(9);
        }
    
    #if 0
        sleep(1);
    #endif
    
        zpxlvl = 1;
    
    #if 0
        ptxwait("SETUP");
    #endif
    
        // attach to tracee
        // NOTE: we do _not_ need to do this because child has done TRACEME
    #if 0
        dbg(zpxlvl,"ptxctl: PREATTACH\n");
        ptxcmd(PTRACE_ATTACH,NULL,NULL);
        dbg(zpxlvl,"ptxctl: POSTATTACH\n");
    #endif
    
        // wait for initial stop
    #if 1
        ptxwait("INIT");
    #endif
    
        if (opt_e)
            ptxwhere(1);
    
        dbg(zpxlvl,"ptxctl: START\n");
    
        tvelap = tvgetf();
    
        zpxlvl = 2;
    
        while (1) {
            dbg(zpxlvl,"ptxctl: SINGLESTEP\n");
            ptxcmd(PTRACE_SINGLESTEP,NULL,NULL);
            ptxwait("WAIT");
    
            sicount += 1;
    
            // show where we are
            if (opt_e)
                ptxwhere(0);
    
            dbg(zpxlvl,"ptxctl: STEPCOUNT sicount=%lld\n",sicount);
    
            // stop when target terminates
            if (stopflg)
                break;
        }
    
        zpxlvl = 0;
        ptxstatus(status);
        printf("ptxctl: STATUS (%s) pidfin=%d\n",statbuf,pidfin);
    
        // get elapsed time
        tvelap = tvgetf() - tvelap;
    }
    
    // main -- main program
    int
    main(int argc,char **argv)
    {
        char *cp;
    
        --argc;
        ++argv;
    
        for (;  argc > 0;  --argc, ++argv) {
            cp = *argv;
            if (*cp != '-')
                break;
    
            switch (cp[1]) {
            case 'd':
                cp += 2;
                opt_d = (*cp != 0) ? atoi(cp) : 1;
                break;
    
            case 'e':
                cp += 2;
                opt_e = (*cp != 0) ? atoi(cp) : 1;
                break;
    
            case 'f':
                cp += 2;
                opt_f = (*cp != 0) ? atoi(cp) : 1;
                break;
            }
        }
    
        if (argc == 0) {
            printf("specify target program\n");
            exit(1);
        }
    
        // set output line buffering
        switch (opt_f) {
        case 0:
            break;
    
        case 1:
            setlinebuf(stdout);
            break;
    
        default:
            setbuf(stdout,NULL);
            break;
        }
    
        ptxctl(argc,argv);
    
        // print statistics
        printf("%llu instructions -- ELAPSED: %.9f -- %.3f insts / sec\n",
            sicount,tvelap,(double) sicount / tvelap);
    
        return 0;
    }
    
    0 讨论(0)
  • 2020-11-27 23:24

    Terminology: what you're asking for is dynamic instruction count. e.g. counting an instruction inside a loop every time it's executed. This is usually roughly correlated with performance, but instructions-per-cycle can vary wildly.

    • How many CPU cycles are needed for each assembly instruction?
    • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

    Something people also look at is static instruction count (or more usually just code-size, because that's what really matters for instruction-cache footprint, and disk-load times). For variable-length instruction sets like x86, those are correlated but not the same thing. On a RISC with fixed-length instructions, like MIPS or AArch64, it's closer but you still have padding for alignment of the start of functions, for example. That's a totally separate metric. gcc -Os optimizes for code-size while trying not to sacrifice to much speed.


    If you're on Linux, use gcc -O2 foo.c to compile your code. -O2 doesn't enable auto-vectorization for gcc. (It does for clang). It's probably a good baseline level of optimization that will get rid of stuff in your C code that doesn't actually need to happen, to avoid silly differences between using more or fewer tmp variables to break up a big expression. Maybe use -Og if you want minimal optimization, or -O0 if you want really dumb braindead code that compiles each statement separately and never keeps anything in registers between statements. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?).

    Yes, it matters a huge amount how you compile. gcc -O3 -march=native -ffast-math might use a lot fewer instructions, if it auto-vectorizes a loop.

    To stop your code from optimizing away, take an input from a command-line arg, or read it from a volatile variable. Like volatile int size_volatile = 1234; int size = size_volatile;. And return or print a result, because if the program has no side-effects then the most efficient implementation is to just exit immediately.


    Then run perf stat ./a.out. That will use hardware performance counters to give you total instructions executed on behalf of your process, including inside the kernel. (Along with other counters, like CPU core clock cycles, and some software counters like page-faults and time in microseconds.)

    To count only user-space instructions, use perf stat -e instructions:u ./a.out. That will still be a very big number even for a simple "hello world" program, like 180k, because that includes dynamic-linker startup and all the code that runs inside library functions. And CRT startup code that calls your main, and that makes an exit system call with main's return value, if you return instead of calling exit(3).

    You might statically link your C program to reduce that startup overhead, by compiling with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie

    perf counting instructions:u seems to be pretty accurate on my Skylake CPU. A statically-linked x86-64 binary that contains only 2 instructions, mov eax, 231 / syscall, is counted as 3 instructions. Probably there's one extra instruction being counted in the transition between kernel and user mode, but that's pretty minor.

    $ perf stat -e instructions:u ./exit    # hand-written in asm to check for perf overhead
    
     Performance counter stats for './exit':
    
                     3      instructions:u                                              
    
           0.000651529 seconds time elapsed
    

    A statically-linked binary that calls puts twice counts 33,202 instructions:u, compiled with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie hello.c. Seems reasonable for glibc init functions, including stdio, and CRT startup stuff before calling main. (main itself only has 8 instructions, which I checked with objdump -drwC -Mintel a.out | less).


    Other resources:

    • Number of executed Instructions different for Hello World program Nasm Assembly and C

      @MichaelPetch's answer shows how to use an alternate libc (MUSL) that doesn't need startup code to run for its printf to work. So you can compile a C program and set its main as the ELF entry point (and call _exit() instead of returning).

    • How can I profile C++ code running on Linux? There are tons of profiling tools for finding hotspots, and expensive functions (including the time spent in functions they call, i.e. stack backtrace profiling). Mostly this isn't about counting instructions, though.


    Binary instrumentation tools:

    These are the heavy duty tools for counting instructions, including counting only specific kinds of instructions.

    • Intel Pin - A Dynamic Binary Instrumentation Tool
    • Intel® Software Development Emulator (SDE) This is based on PIN, and is handy for things like testing AVX512 code on a dev machine that doesn't support AVX512. (It dynamically recompiles so most instructions run natively, but unsupported instructions call an emulation routine.)

      For example, sde64 -mix -- ./my_program will print an instruction-mix for your program, with total counts for each different instruction, and breakdowns by categories. See libsvm compiled with AVX vs no AVX for an example of the kind of output.

      It also gives you a table of total dynamic instruction counts per-function, as well as per-thread and global. SDE mix output doesn't work well on PIE executable, though: it thinks the dynamic linker is the executable (because it is), so compile with gcc -O2 -no-pie -fno-pie prog.c -o prog. It still doesn't see the puts calls or main itself in the profile output for a hello world test program, though, and I don't know why not.

    • Calculating “FLOP” using Intel® Software Development Emulator (Intel® SDE) An example of using SDE to count certain kinds of instructions, like vfmadd231pd.

      Intel CPUs have HW perf counters for events like fp_arith_inst_retired.256b_packed_double, so you can use those to count FLOPs instead. They actually count FMA as 2 events. So if you have an Intel CPU that can run your code natively, you can do that instead with perf stat -e -e fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.scalar_double. (And/or the events for single-precision.)

      But there aren't events for most other specific kinds of instructions, only FP math.

    This is all Intel stuff; IDK what AMD has, or any stuff for ISAs other than x86. These are just the tools I've heard of; I'm sure there are lots of things I'm leaving out.

    0 讨论(0)
提交回复
热议问题