What's the best timing resolution can i get on Linux

前端未结

关注

 3  1543

I\'m trying to measure the time difference between 2 signals on the parallel port, but first i got to know how much accurate and precise is my measuring system (AMD Athlon(t

相关标签:

3条回答

暖寄归人

2020-12-20 10:24
I realise this topic is long dead, but wanted to throw in my findings. This is a long answer so I have put the short answer here and those with the patience can wade through the rest. The not-quite-the-answer to the question is 700 ns or 1500 ns depending on which mode of clock_gettime() you used. The long answer is way more complicated.

For reference, the machine I did this work on is an old laptop that nobody wanted. It is an Acer Aspire 5720Z running Ubuntu 14.041 LTS.

The hardware:
RAM: 2.0 GiB // This is how Ubuntu reports it in 'System Settings' → 'Details'
Processor: Intel® Pentium(R) Dual CPU T2330 @ 1.60GHz × 2
Graphics: Intel® 965GM x86/MMX/SSE2

I wanted to measure time accurately in an upcoming project and as a relative new comer to PC hardware regardless of operating system, I thought I would do some experimentation on the resolution of the timing hardware. I stumbled across this question.

Because of this question, I decided that clock_gettime() looks like it meets my needs. But my experience with PC hardware in the past has left me under-whelmed so I started fresh with some experiments to see what the actual resolution of the timer is.

The method: Collect successive samples of the result from clock_gettime() and look any patterns in the resolution. Code follows.

Results in a slightly longer Summary:
1. Not really a result. The stated resolution of the fields in the structure is in nanoseconds. The result of a call to clock_getres() is also tv_sec 0, tv_nsec 1. But previous experience has taught to not trust the resolution from a structure alone. It is an upper limit on precision and reality tends to be a whole lot more complex.
2. The actual resolution of the clock_gettime() result on my machine, with my program, with my operating system, on one particular day etc turns out to be 70 nanoseconds for mode 0 and 1. 70 ns is not too bad but unfortunately, this is not realistic as we will see in the next point. To complicate matters, the resolution appears to be 7 ns when using modes 2 and 3.
3. Duration of the clock_gettime() call is more like 1500 ns for modes 0 and 1. It doesn't make sense to me at all to claim 70 ns resolution on the time if it takes 20 times the resolution to get a value.
4. Some modes of clock_gettime() are faster than others. Modes 2 and 3 are clearly about half the wall-clock time of modes 0 and 1. Modes 0 and 1 are statistically indistinguishable from each other. Modes 2 and 3 are much faster than modes 0 and 1, with mode 3 being the fastest overall.
Before continuing, I better define the modes: Which mode is which?:
Mode 0 CLOCK_REALTIME // reference: http://linux.die.net/man/3/clock_gettime
Mode 1 CLOCK_MONOTONIC
Mode 2 CLOCK_PROCESS_CPUTIME_ID
Mode 3 CLOCK_THREAD_CPUTIME_ID

Conclusion: To me it doesn't make sense to talk about the resolution of the time intervals if the resolution is smaller than the length of time the function takes to get the time interval. For example, if we use mode 3, we know that the function completes within 700 nanoseconds 99% of the time. And we further know that the time interval we get back will be a multiple of 7 nanoseconds. So the 'resolution' of 7 nanoseconds, is 1/100th of the time to do the call to get the time. I don't see any value in the 7 nanosecond change interval. There are 3 different answers to the question of resolution: 1 ns, 7 or 70 ns, and finally 700 or 1500 ns. I favour the last figure.

After all is said and done, if you want to measure the performance of some operation, you need to keep in mind how long the clock_gettime() call takes – that is 700 or 1500 ns. There is no point trying to measure something that takes 7 nanoseconds for example. For the sake of argument, lets say you were willing to live with 1% error on your performance test conclusions. If using mode 3 (which I think I will be using in my project) you would have to say that the interval you need to be measuring needs to be 100 times 700 nanoseconds or 70 microseconds. Otherwise your conclusions will have more than 1% error. So go ahead and measure your code of interest, but if your elapsed time in the code of interest is less that 70 microseconds, then you better go and loop through the code of interest enough times so that the interval is more like 70 microseconds or more.

Justification for these claims and some details:

Claim 3 first. This is simple enough. Just run clock_gettime() a large number of times and record the results in an array, then process the results. Do the processing outside the loop so that the time between clock_gettime() calls is as short as possible.

What does all that mean? See the graph attached. For mode 0 for example, the call to clock_gettime() takes less than 1.5 microseconds most of the time. You can see that mode 0 and mode 1 are basically the same. However, modes 2 and 3 are very different to modes 0 and 1, and slightly different to each other. Modes 2 and 3 take about half the wall-clock time for clock_gettime() compared to modes 0 and 1. Also note that mode 0 and 1 are slightly different to each other – unlike modes 2 and 3. Note that mode 0 and 1 differ by 70 nanoseconds – which is a number which we will come back to in claim #2.

The attached graph is range-limited to 2 microseconds. Otherwise the outliers in the data prevents the graph from conveying the previous point. Something the graph doesn't make clear then is that the outliers for modes 0 and 1 are much worse than the outliers for modes 2 and 3. In other words, not only is the average and the statistical 'mode' (the value which occurs the most) and the median (i.e. the 50th percentile) for all these modes different so is there maximum values and their 99th percentiles.

The graph attached is for 100,001 samples for each of the four modes. Please note that the tests graphed were using a CPU mask of processor 0 only. Whether I used CPU affinity or not didn't seem to make any difference to the graph.

Claim 2: If you look closely at the samples collected when preparing the graph, you soon notice that the difference between the differences (i.e. the 2nd order differences) is relatively constant – at around 70 nanoseconds (fore Modes 0 and 1 at least). To repeat this experiment, collect 'n' samples of clock time as before. Then calculate the differences between each sample. Now sort the differences into order (e.g. sort -g) and then derive the individual unique differences (e.g. uniq -c).

For example:
```
$ ./Exp03 -l 1001 -m 0 -k | sort -g | awk -f mergeTime2.awk | awk -f percentages.awk | sort -g
1.118e-06 8 8 0.8 0.8       // time,count,cumulative count, count%, cumulative count%
1.188e-06 17 25 1.7 2.5
1.257e-06 9 34 0.9 3.4
1.327e-06 570 604 57 60.4
1.397e-06 301 905 30.1 90.5
1.467e-06 53 958 5.3 95.8
1.537e-06 26 984 2.6 98.4
<snip>
```
The difference between the durations in the first column is often 7e-8 or 70 nanoseconds. This can become more clear by processing the differences:
```
$ <as above> | awk -f differences.awk 
7e-08
6.9e-08
7e-08
7e-08
7e-08
7e-08
6.9e-08
7e-08
2.1e-07 // 3 lots of 7e-08
<snip>
```
Notice how all the differences are integer multiples of 70 nanoseconds? Or at least within rounding error of 70 nanoseconds.

This result may well be hardware dependent but I don't actually know what limits this to 70 nanoseconds at this time. Perhaps there is 14.28 MHz oscillator somewhere?

Please note that in practise I use a much larger number of samples such as 100,000, not 1000 as above.

Relevant code (attached):

'Expo03' is the program which calls clock_gettime() as fast as possible. Note that typical usage would be something like:

./Expo03 -l 100001 -m 3

This would call clock_gettime() 100,001 times so that we can compute 100,000 differences. Each call to clock_gettime() in this example would be using mode 3.

MergeTime2.awk is a useful command which is a glorified 'uniq' command. The issue is that the 2nd order differences are often in pairs of 69 and 1 nanosecond, not 70 (for Mode 0 and 1 at least) as I have lead you to believe so far. Because there is no 68 nanosecond difference or a 2 nanosecond difference, I have merged these 69 and 1 nanosecond pairs into one number of 70 nanoseconds. Why the 69/1 behaviour occurs at all is interesting, but treating these as two separate numbers mostly added 'noise' to the analysis.

Before you ask, I have repeated this exercise avoiding floating point, and the same problem still occurs. The resulting tv_nsec as an integer has this 69/1 behaviour (or 1/7 and 1/6) so please don't assume that this is an artefact caused by floating point subtraction.

Please note that I am confident with this 'simplification' for 70 ns and for small integer multiples of 70 ns, but this approach looks less robust for the 7 ns case especially when you get 2nd order differences of 10 times the 7 ns resolution.

percentages.awk and differences.awk attached in case.

Stop press: I can't post the graph as I don't have a 'reputation of at least 10'. Sorry 'bout that.

Rob Watson 21 Nov 2014

Expo03.cpp
```
/* Like Exp02.cpp except that here I am experimenting with
   modes other than CLOCK_REALTIME
   RW 20 Nov 2014
*/

/* Added CPU affinity to see if that had any bearing on the results
   RW 21 Nov 2014
*/

#include <iostream>
using namespace std;
#include <iomanip>

#include <stdlib.h> // getopts needs both of these
#include <unistd.h>

#include <errno.h> // errno

#include <string.h> // strerror()

#include <assert.h>

// #define MODE CLOCK_REALTIME
// #define MODE CLOCK_MONOTONIC
// #define MODE CLOCK_PROCESS_CPUTIME_ID
// #define MODE CLOCK_THREAD_CPUTIME_ID

int main(int argc, char ** argv)
{
  int NumberOf = 1000;
  int Mode = 0;
  int Verbose = 0;
  int c;
  // l loops, m mode, h help, v verbose, k masK


  int rc;
  cpu_set_t mask;
  int doMaskOperation = 0;

  while ((c = getopt (argc, argv, "l:m:hkv")) != -1)
  {
    switch (c)
      {
      case 'l': // ell not one
        NumberOf = atoi(optarg);
        break;
      case 'm':
        Mode = atoi(optarg);
        break;
      case 'h':
        cout << "Usage: <command> -l <int> -m <mode>" << endl
             << "where -l represents the number of loops and "
             << "-m represents the mode 0..3 inclusive" << endl
             << "0 is CLOCK_REALTIME" << endl
             << "1 CLOCK_MONOTONIC" <<  endl
             << "2 CLOCK_PROCESS_CPUTIME_ID" << endl
             << "3 CLOCK_THREAD_CPUTIME_ID" << endl;
        break;
      case 'v':
        Verbose = 1;
        break;
      case 'k': // masK - sorry! Already using 'm'...
        doMaskOperation = 1;
        break;
      case '?':
        cerr << "XXX unimplemented! Sorry..." << endl;
        break;
      default:
        abort();
      }
  }

  if (doMaskOperation)
  {
    if (Verbose)
    {
      cout << "Setting CPU mask to CPU 0 only!" << endl;
    }
    CPU_ZERO(&mask);
    CPU_SET(0,&mask);
    assert((rc = sched_setaffinity(0,sizeof(mask),&mask))==0);
  }

  if (Verbose) {
    cout << "Verbose: Mode in use: " << Mode << endl;
  }

  if (Verbose)
  {
    rc = sched_getaffinity(0,sizeof(mask),&mask);
    // cout << "getaffinity rc is " << rc << endl;
    // cout << "getaffinity mask is " << mask << endl;
    int numOfCPUs = CPU_COUNT(&mask);
    cout << "Number of CPU's is " << numOfCPUs << endl;
    for (int i=0;i<sizeof(mask);++i) // sizeof(mask) is 128 RW 21 Nov 2014
    {
      if (CPU_ISSET(i,&mask))
      {
        cout << "CPU " << i << " is set" << endl;
      }
      //cout << "CPU " << i 
      //     << " is " << (CPU_ISSET(i,&mask) ? "set " : "not set ") << endl;
    }
  }

  clockid_t cpuClockID;
  int err = clock_getcpuclockid(0,&cpuClockID);
  if (Verbose)
  {
    cout << "Verbose: clock_getcpuclockid(0) returned err " << err << endl;
    cout << "Verbose: clock_getcpuclockid(0) returned cpuClockID " 
       << cpuClockID << endl;
  }

  timespec timeNumber[NumberOf];
  for (int i=0;i<NumberOf;++i)
  {
    err = clock_gettime(Mode, &timeNumber[i]);
    if (err != 0) {
      int errSave = errno;
      cerr << "errno is " << errSave 
           << " NumberOf is " << NumberOf << endl;
      cerr << strerror(errSave) << endl;
      cerr << "Aborting due to this error" << endl;
      abort();
    }
  }

  for (int i=0;i<NumberOf-1;++i)
  {
    cout << timeNumber[i+1].tv_sec - timeNumber[i].tv_sec
            + (timeNumber[i+1].tv_nsec - timeNumber[i].tv_nsec) / 1000000000.
         << endl;
    
  }
  return 0;
}
```
MergeTime2.awk
```
BEGIN {
 PROCINFO["sorted_in"] = "@ind_num_asc"
}

{array[$0]++}

END {
  lastX = -1;
  first = 1;
  
  for (x in array)
  {
    if (first) { 
      first = 0 
      lastX = x; lastCount = array[x]; 
    } else {
      delta = x - lastX;
      if (delta < 2e-9) { # this is nasty floating point stuff!!
        lastCount += array[x]; 
        lastX = x
      } else {
        Cumulative += lastCount;
        print lastX "\t" lastCount "\t" Cumulative
        lastX = x; 
        lastCount = array[x]; 
      }
    }
  }
  print lastX "\t" lastCount "\t" Cumulative+lastCount
}
```
percentages.awk
```
{ # input is $1 a time interval $2 an observed frequency (i.e. count)
  # $3 is a cumulative frequency
  b[$1]=$2;
  c[$1]=$3;
  sum=sum+$2
} 

END {
  for (i in b) print i,b[i],c[i],(b[i]/sum)*100, (c[i]*100/sum);
}
```
differences.awk
```
NR==1 {
  old=$1;next
} 
{
  print $1-old;
  old=$1
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-12-20 10:26
RDTSCP on your AMD Athlon 64 X2 will give you the time stamp counter with resolution dependent upon your clock. However accuracy is different to resolution, you need to lock thread affinity and disable interrupts (see IRQ routing).

This entails dropping down to assembler or for Windows developers using MSVC 2008 instrinsics.

RedHat with RHEL5 introduced user-space shims that replace gettimeofday with high resolution RDTSCP calls:
- ~~http://developer.amd.com/Resources/documentation/articles/Pages/1214200692_5.aspx~~
- https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html
Also, check your hardware an AMD 5200 has a 2.6Ghz clock which has 0.4ns interval and the cost of gettimeofday with RDTSCP is 221 cycles that equals 88ns at best.
0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-20 10:35

As far as I know, Linux running on a PC will generally not be able to give you timer accuracy in the nanoseconds range. This is mainly due to the type of task/process scheduler used in the kernel. This is as much a result of the kernel as it is of the hardware.

If you need timing with nanosecond resolution I'm afraid that you're out of luck. However you should be able to get micro-second resolution which should be good enough for most scenarios - including your parallel port application.

If you need timing in the nano-seconds range to be accurate to the nano-second you will need a dedicated hardware solution most likely; with a really accurate oscillator (for comparison, the base clock frequency of most x86 CPUs is in the range of mega-hertz before the multipliers)

Finally, if you're looking to replace the functionality of an oscilloscope with your computer that's just not going to work beyond relatively low frequency signals. You'd be much better off investing in a scope - even a simple, portable, hand-held that plugs into your computer for displaying the data.

0 讨论(0)
发布评论:

提交评论
- 加载中...