Sorry, it\'s a long one, but I\'m just explaining my train of thought as I analyze this. Questions at the end.
I have an understanding of what goes into measuring runni
I would lean toward the last, but I'd consider whether the overhead of starting and stopping a timer could be greater than that of looping itself.
One thing to consider though, is whether the effect of CPU cache misses is actually a fair thing to try to counter?
Taking advantage of CPU caches is something where one approach may beat another, but in real world cases there might be a cache-miss with each call so this advantage becomes inconsequential. In this case the approach that made less good use of the cache could become that which has better real-world performance.
An array-based or singly-linked-list-based queue would be an example; the former almost always having greater performance when cache-lines don't get refilled in between calls, but suffering on resize-operations more than the latter. Hence the latter can win in real-world cases (all the more so as they are easier to write in a lock-free form) even though they will almost always lose in the rapid iterations of timing tests.
For this reason it can also be worth trying some iterations with something to actually force the cache to be flushed. Can't think what the best way to do that would be right now, so I might come back and add to this if I do.
I think your first code sample seems like the best approach.
Your first code sample is small, clean and simple and doesn't use any major abstractions during the test loop which may introduce hidden overhead.
Use of the Stopwatch class is a good thing as it simplifies the code one normally has to write to get high-resolution timings.
One thing you might consider is providing the option to iterate the test for a smaller number of times untimed before entering the timing loop to warm up any caches, buffers, connections, handles, sockets, threadpool threads etc. that the test routine may exercise.
HTH.