Is my method of measuring running time flawed?

前端 未结 8 621
春和景丽
春和景丽 2021-02-05 12:14

Sorry, it\'s a long one, but I\'m just explaining my train of thought as I analyze this. Questions at the end.

I have an understanding of what goes into measuring runni

相关标签:
8条回答
  • 2021-02-05 12:16

    Depending on what the running time of the code you're testing is, it's quite difficult to measure the individual runs. If the runtime of the code your testing is multiple seconds, your approach of timing the specific run will most likely not be a problem. If it's in the vicinity of milliseconds, your results will probably very too much. If you e.g. have a context switch or a read from the swap file at the wrong moment, the runtime of that run will be disproportionate to the average runtime.

    0 讨论(0)
  • 2021-02-05 12:19

    I tend to agree with @Sam Saffron about using one Stopwatch rather than one per iteration. In your example you performing 1000000 iterations by default. I don't know what the cost of creating a single Stopwatch is, but you are creating 1000000 of them. Conceivably, that in and of itself could affect your test results. I reworked your "final implementation" a little bit to allow the measurement of each iteration without creating 1000000 Stopwatches. Granted, since I am saving the result of each iteration, I am allocating 1000000 longs, but at first glance it seems like that would have less overall affect than allocating that many Stopwatches. I haven't compared my version to your version to see if mine would yield different results.

    static void Test2<T>(string testName, Func<T> test, int iterations = 1000000)
    {
      long [] results = new long [iterations];
    
      // print header 
      for (int i = 0; i < 100; i++) // warm up the cache 
      {
        test();
      }
    
      var timer = System.Diagnostics.Stopwatch.StartNew(); // time whole process 
    
      long start;
    
      for (int i = 0; i < results.Length; i++)
      {
        start = Stopwatch.GetTimestamp();
        test();
        results[i] = Stopwatch.GetTimestamp() - start;
      }
    
      timer.Stop();
    
      double ticksPerMillisecond = Stopwatch.Frequency / 1000.0;
    
      Console.WriteLine("Time(ms): {0,3}/{1,10}/{2,8} ({3,10})", results.Min(t => t / ticksPerMillisecond), results.Average(t => t / ticksPerMillisecond), results.Max(t => t / ticksPerMillisecond), results.Sum(t => t / ticksPerMillisecond));
      Console.WriteLine("Ticks:    {0,3}/{1,10}/{2,8} ({3,10})", results.Min(), results.Average(), results.Max(), results.Sum());
    
      Console.WriteLine();
    }
    

    I am using the Stopwatch's static GetTimestamp method twice in each iteration. The delta between will be the amount of time spent in the iteration. Using Stopwatch.Frequency, we can convert the delta values to milliseconds.

    Using Timestamp and Frequency to calculate performance is not necessarily as clear as just using a Stopwatch instance directly. But, using a different stopwatch for each iteration is probably not as clear as using a single stopwatch to measure the whole thing.

    I don't know that my idea is any better or any worse than yours, but it is slightly different ;-)

    I also agree about the warmup loop. Depending on what your test is doing, there could be some fixed startup costs that you don't want to affect the overall results. The startup loop should eliminate that.

    There is proabably a point at which keeping each individual timing result is counterproductive due to the cost of storage necessary to hold the whole array of values (or timers). For less memory, but more processing time, you could simply sum the deltas, computing the min and max as you go. That has the potential of throwing off your results, but if you are primarily concerned with the statistics generated based on the invidivual iteration measurements, then you can just do the min and max calculation outside of the time delta check:

    static void Test2<T>(string testName, Func<T> test, int iterations = 1000000)
    {
      //long [] results = new long [iterations];
      long min = long.MaxValue;
      long max = long.MinValue;
    
      // print header 
      for (int i = 0; i < 100; i++) // warm up the cache 
      {
        test();
      }
    
      var timer = System.Diagnostics.Stopwatch.StartNew(); // time whole process 
    
      long start;
      long delta;
      long sum = 0;
    
      for (int i = 0; i < iterations; i++)
      {
        start = Stopwatch.GetTimestamp();
        test();
        delta = Stopwatch.GetTimestamp() - start;
        if (delta < min) min = delta;
        if (delta > max) max = delta;
        sum += delta;
      }
    
      timer.Stop();
    
      double ticksPerMillisecond = Stopwatch.Frequency / 1000.0;
    
      Console.WriteLine("Time(ms): {0,3}/{1,10}/{2,8} ({3,10})", min / ticksPerMillisecond, sum / ticksPerMillisecond / iterations, max / ticksPerMillisecond, sum);
      Console.WriteLine("Ticks:    {0,3}/{1,10}/{2,8} ({3,10})", min, sum / iterations, max, sum);
    
      Console.WriteLine();
    }
    

    Looks pretty old school without the Linq operations, but it still gets the job done.

    0 讨论(0)
  • 2021-02-05 12:20

    Regardless of the mechanism for timing your function (and the answers here seems fine) there is a very simple trick to eradicate the overhead of the benchmarking-code itself, i.e. the overhead of the loop, timer-readings, and method-call:

    Simply call your benchmarking code with an empty Func<T> first, i.e.

    void EmptyFunc<T>() {}
    

    This will give you a baseline of the timing-overhead, which you can essentially subtract from the latter measurements of your actual benchmarked function.

    By "essentially" I mean that there are always room for variations when timing some code, due to garbage collection and thread and process scheduling. A pragmatic approach would e.g. be to benchmark the empty function, find the average overhead (total time divided by iterations) and then subtract that number from each timing-result of the real benchmarked function, but don't let it go below 0 which wouldn't make sense.

    You will, of course, have to re-arrange your benchmarking code a bit. Ideally you'll want to use the exact same code to benchmark the empty function and real benchmarked function, so I suggest you move the timing-loop into another function or at least keep the two loops completely alike. In summary

    1. benchmark the empty function
    2. calculate the average overhead from the result
    3. benchmark the real test-function
    4. subtract the average overhead from those test results
    5. you're done

    By doing this the actual timing mechanism suddenly becomes a lot less important.

    0 讨论(0)
  • 2021-02-05 12:20

    I had a similar question here.

    I much prefer the concept of using a single stopwatch, especially if you are micro benchamrking. Your code is not accounting for the GC which can affect performance.

    I think forcing a GC collection is pretty important prior to running test runs, also I am not sure what the point is of a 100 warmup run.

    0 讨论(0)
  • 2021-02-05 12:25

    The logic in Approach 2 feels 'righter' to me, but I'm just a CS student.

    I came across this link that you might find of interest: http://www.yoda.arachsys.com/csharp/benchmark.html

    0 讨论(0)
  • 2021-02-05 12:36

    My first thought is that a loop as simple as

    for (int i = 0; i < x; i++)
    {
        timer.Start();
        test();
        timer.Stop();
    }
    

    is kinda silly compared to:

    timer.Start();
    for (int i = 0; i < x; i++)
        test();
    timer.Stop();
    

    the reason is that (1) this kind of "for" loop has a very tiny overhead, so small that it's almost not worth worrying about even if test() only takes a microsecond, and (2) timer.Start() and timer.Stop() have their own overhead, which is likely to affect the results more than the for loop. That said, I took a peek at Stopwatch in Reflector and noticed that Start() and Stop() are fairly cheap (calling Elapsed* properties is likely more expensive, considering the math involved.)

    Make sure the IsHighResolution property of Stopwatch is true. If it's false, Stopwatch uses DateTime.UtcNow, which I believe is only updated every 15-16 ms.

    1. Is getting the running time of each individual iteration generally a good thing to have?

    It is not usually necessary to measure the runtime of each individual iteration, but it is useful to find out how much the performance varies between different iterations. To this end, you can compute the min/max (or k outliers) and standard deviation. Only the "median" statistic requires you to record every iteration.

    If you find that the standard deviation is large, you might then have reason to reason to record every iteration, in order to explore why the time keeps changing.

    Some people have written small frameworks to help you do performance benchmarks. For example, CodeTimers. If you are testing something that is so tiny and simple that the overhead of the benchmark library matters, consider running the operation in a for-loop inside the lambda that the benchmark library calls. If the operation is so tiny that the overhead of a for-loop matters (e.g. measuring the speed of multiplication), then use manual loop unrolling. But if you use loop unrolling, remember that most real-world apps don't use manual loop unrolling, so your benchmark results may overstate the real-world performance.

    For myself I wrote a little class for gathering min, max, mean, and standard deviation, which could be used for benchmarks or other statistics:

    // A lightweight class to help you compute the minimum, maximum, average
    // and standard deviation of a set of values. Call Clear(), then Add(each
    // value); you can compute the average and standard deviation at any time by 
    // calling Avg() and StdDeviation().
    class Statistic
    {
        public double Min;
        public double Max;
        public double Count;
        public double SumTotal;
        public double SumOfSquares;
    
        public void Clear()
        {
            SumOfSquares = Min = Max = Count = SumTotal = 0;
        }
        public void Add(double nextValue)
        {
            Debug.Assert(!double.IsNaN(nextValue));
            if (Count > 0)
            {
                if (Min > nextValue)
                    Min = nextValue;
                if (Max < nextValue)
                    Max = nextValue;
                SumTotal += nextValue;
                SumOfSquares += nextValue * nextValue;
                Count++;
            }
            else
            {
                Min = Max = SumTotal = nextValue;
                SumOfSquares = nextValue * nextValue;
                Count = 1;
            }
        }
        public double Avg()
        {
            return SumTotal / Count;
        }
        public double Variance()
        {
            return (SumOfSquares * Count - SumTotal * SumTotal) / (Count * (Count - 1));
        }
        public double StdDeviation()
        {
            return Math.Sqrt(Variance());
        }
        public Statistic Clone()
        {
            return (Statistic)MemberwiseClone();
        }
    };
    

    2. Is having a small loop of runs before the actual timing starts good too?

    Which iterations you measure depends on whether you care most about startup time, steady-state time or total runtime. In general, it may be useful to record one or more runs separately as "startup" runs. You can expect the first iteration (and sometimes more than one) to run more slowly. As an extreme example, my GoInterfaces library consistently takes about 140 milliseconds to produce its first output, then it does 9 more in about 15 ms.

    Depending on what the benchmark measures, you may find that if you run the benchmark right after rebooting, the first iteration (or first few iterations) will run very slowly. Then, if you run the benchmark a second time, the first iteration will be faster.

    3. Would a forced Thread.Yield() within the loop help or hurt the timings of CPU bound test cases?

    I'm not sure. It may clear the processor caches (L1, L2, TLB), which would not only slow down your benchmark overall but lower the measured speeds. Your results will be more "artificial", not reflecting as well what you would get in the real world. Perhaps a better approach is to avoid running other tasks at the same time as your benchmark.

    0 讨论(0)
提交回复
热议问题