My guess is that the __no_operation()
intrinsic (ARM) instruction should take 1/(168 MHz) to execute, provided that each NOP
executes in o
The number of clock cycles per instruction DO matter.
On an avr, its (usually) 1 instruction/clock, so a 12Mhz AVR runs at about 12 mips
On a PIC, its usually 1 instruction/4 clocks, so a 12Mhz PIC runs at about 3 mips
On an 8051 (orig) its 1 instruction/12 clocks, so a 12Mhz 8051 runs at about 1 mips
To know how much you can get done, instructions/clock are relevant. This is why an AMD processor could get more done /Mhz than an Intel processor.
If you carefully configure all your clocks in the Reset and Clock Control (RCT) and you know all the clocks you can exactly calculate the instruction execution time for most of the instructions and have at least a worst case evaluation for all of them. For example I'm using a stm32f439Zi processor, which is a cortex-m4 compatible with the stm32f407. If you look at the reference manual the clock tree is showing you the PLL and all buss prescalers. In my case I have a 8 MHz external quarts with PLL configured to provide 84 Mhz system clock SYSCLK. That means that one processor cycle is 1.0/84e6 ~ 12 ns.
For reference of the how many cycles or SYSCLK one instruction takes you are using the ARM® Cortex®‑M4 Processor Technical Reference Manual. For example the MOV instruction in most of the cases takes a cycle. ADD instruction in most of the cases takes a cycle, which means that after 12 ns you have the result of the addition stored in the register and ready for a use by another operation.
You can use that information to schedule your processor resources in many cases, such as periodic interrupts for instance, and the electrical and the low-level embedded system software developers are talking about that and are doing that when it comes to strict real-time and safety critical systems. Normally engineers are working with the worst case execution time during the design ignoring the pipeline to have a quick and rough inside of the processor load. At the implementation you are using tools for precise time analysis and refine the software.
In the process of the design and implementation the non-deterministic things are reduced to negligible.
ALL instructions require more than one clock cycle to execute. Fetch, decode, execute. If you are running on an stm32 you are likely taking several clocks per fetch just due to the slowness of the prom, if running from ram who knows if it is 168Mhz or slower. the arm busses generally take a number of clock cycles to do anything.
Nobody talks about instruction cycles anymore because they are not deterministic. The answer is always "it depends".
It may take X hours to build a single car, but if you start building a car then 30 seconds later start building another and every 30 seconds start another then after X hours you will have a new car every 30 seconds. Does that mean it takes 30 seconds to make a car? Of course not. But it does mean that once up and running you can average a new car every 30 seconds on that production line.
That is exactly how processors work, it takes a number of clocks per instruction to run, but you pipeline theme so that many are in the pipe at once so that the average is such that the core, if fed the right instructions one per clock, can complete those instructions one per clock. With branching, and slow memory/rom, you cant even expect to get that.
if you want to do an experiment on your processor, then make a loop with a few hundred nops
beg = read time
load r0 = 100000
top:
nop
nop
nop
nop
nop
nop
...
nop
nop
nop
r0 = r0 - 1
bne top
end = read timer
If it takes fractions of a second to complete that loop then either make the number of nops larger or have it run an order of magnitude more loops. Actually you want to hit a significant number of timer ticks, not necessarily seconds or minutes on a wall clock but something in terms of a good sized number of timer ticks.
Then do the math and compute the average.
Repeat the experiment with the program sitting in ram instead of rom
Slow the processor clock down to whatever the fastest time is that does not require a flash divisor, repeat running from flash.
being a cortex-m4 turn the I cache on, repeat using flash, repeat using ram (At 168Mhz).
If you didnt get a range of different results from all of these experiments using the same test loop, you are probably doing something wrong.
Because pipelining affects perceived execution time, a single instruction will measure differently than a sequence of the same instruction.
You could measure the timing of the scenario you care about using the built-in cycle-counting register, as discussed in your other post here.
Similarly, you might try using and reg, reg
instead of nop
, since Cortex F4 may not behave as you expect, using nop
instructions.