I wish to calculate the time it took for an API to return a value. The time taken for such an action is in the space of nano seconds. As the API is a C++ class/function, I a
If you need subsecond precision, you need to use system-specific extensions, and will have to check with the documentation for the operating system. POSIX supports up to microseconds with gettimeofday, but nothing more precise since computers didn't have frequencies above 1GHz.
If you are using Boost, you can check boost::posix_time.
What do you think about that:
int iceu_system_GetTimeNow(long long int *res)
{
static struct timespec buffer;
//
#ifdef __CYGWIN__
if (clock_gettime(CLOCK_REALTIME, &buffer))
return 1;
#else
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &buffer))
return 1;
#endif
*res=(long long int)buffer.tv_sec * 1000000000LL + (long long int)buffer.tv_nsec;
return 0;
}
You can use Embedded Profiler (free for Windows and Linux) which has an interface to a multiplatform timer (in a processor cycle count) and can give you a number of cycles per seconds:
EProfilerTimer timer;
timer.Start();
... // Your code here
const uint64_t number_of_elapsed_cycles = timer.Stop();
const uint64_t nano_seconds_elapsed =
mumber_of_elapsed_cycles / (double) timer.GetCyclesPerSecond() * 1000000000;
Recalculation of cycle count to time is possibly a dangerous operation with modern processors where CPU frequency can be changed dynamically. Therefore to be sure that converted times are correct, it is necessary to fix processor frequency before profiling.
This new answer uses C++11's <chrono>
facility. While there are other answers that show how to use <chrono>
, none of them shows how to use <chrono>
with the RDTSC
facility mentioned in several of the other answers here. So I thought I would show how to use RDTSC
with <chrono>
. Additionally I'll demonstrate how you can templatize the testing code on the clock so that you can rapidly switch between RDTSC
and your system's built-in clock facilities (which will likely be based on clock()
, clock_gettime()
and/or QueryPerformanceCounter
.
Note that the RDTSC
instruction is x86-specific. QueryPerformanceCounter
is Windows only. And clock_gettime()
is POSIX only. Below I introduce two new clocks: std::chrono::high_resolution_clock
and std::chrono::system_clock
, which, if you can assume C++11, are now cross-platform.
First, here is how you create a C++11-compatible clock out of the Intel rdtsc
assembly instruction. I'll call it x::clock
:
#include <chrono>
namespace x
{
struct clock
{
typedef unsigned long long rep;
typedef std::ratio<1, 2'800'000'000> period; // My machine is 2.8 GHz
typedef std::chrono::duration<rep, period> duration;
typedef std::chrono::time_point<clock> time_point;
static const bool is_steady = true;
static time_point now() noexcept
{
unsigned lo, hi;
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return time_point(duration(static_cast<rep>(hi) << 32 | lo));
}
};
} // x
All this clock does is count CPU cycles and store it in an unsigned 64-bit integer. You may need to tweak the assembly language syntax for your compiler. Or your compiler may offer an intrinsic you can use instead (e.g. now() {return __rdtsc();}
).
To build a clock you have to give it the representation (storage type). You must also supply the clock period, which must be a compile time constant, even though your machine may change clock speed in different power modes. And from those you can easily define your clock's "native" time duration and time point in terms of these fundamentals.
If all you want to do is output the number of clock ticks, it doesn't really matter what number you give for the clock period. This constant only comes into play if you want to convert the number of clock ticks into some real-time unit such as nanoseconds. And in that case, the more accurate you are able to supply the clock speed, the more accurate will be the conversion to nanoseconds, (milliseconds, whatever).
Below is example code which shows how to use x::clock
. Actually I've templated the code on the clock as I'd like to show how you can use many different clocks with the exact same syntax. This particular test is showing what the looping overhead is when running what you want to time under a loop:
#include <iostream>
template <class clock>
void
test_empty_loop()
{
// Define real time units
typedef std::chrono::duration<unsigned long long, std::pico> picoseconds;
// or:
// typedef std::chrono::nanoseconds nanoseconds;
// Define double-based unit of clock tick
typedef std::chrono::duration<double, typename clock::period> Cycle;
using std::chrono::duration_cast;
const int N = 100000000;
// Do it
auto t0 = clock::now();
for (int j = 0; j < N; ++j)
asm volatile("");
auto t1 = clock::now();
// Get the clock ticks per iteration
auto ticks_per_iter = Cycle(t1-t0)/N;
std::cout << ticks_per_iter.count() << " clock ticks per iteration\n";
// Convert to real time units
std::cout << duration_cast<picoseconds>(ticks_per_iter).count()
<< "ps per iteration\n";
}
The first thing this code does is create a "real time" unit to display the results in. I've chosen picoseconds, but you can choose any units you like, either integral or floating point based. As an example there is a pre-made std::chrono::nanoseconds
unit I could have used.
As another example I want to print out the average number of clock cycles per iteration as a floating point, so I create another duration, based on double, that has the same units as the clock's tick does (called Cycle
in the code).
The loop is timed with calls to clock::now()
on either side. If you want to name the type returned from this function it is:
typename clock::time_point t0 = clock::now();
(as clearly shown in the x::clock
example, and is also true of the system-supplied clocks).
To get a duration in terms of floating point clock ticks one merely subtracts the two time points, and to get the per iteration value, divide that duration by the number of iterations.
You can get the count in any duration by using the count()
member function. This returns the internal representation. Finally I use std::chrono::duration_cast
to convert the duration Cycle
to the duration picoseconds
and print that out.
To use this code is simple:
int main()
{
std::cout << "\nUsing rdtsc:\n";
test_empty_loop<x::clock>();
std::cout << "\nUsing std::chrono::high_resolution_clock:\n";
test_empty_loop<std::chrono::high_resolution_clock>();
std::cout << "\nUsing std::chrono::system_clock:\n";
test_empty_loop<std::chrono::system_clock>();
}
Above I exercise the test using our home-made x::clock
, and compare those results with using two of the system-supplied clocks: std::chrono::high_resolution_clock
and std::chrono::system_clock
. For me this prints out:
Using rdtsc:
1.72632 clock ticks per iteration
616ps per iteration
Using std::chrono::high_resolution_clock:
0.620105 clock ticks per iteration
620ps per iteration
Using std::chrono::system_clock:
0.00062457 clock ticks per iteration
624ps per iteration
This shows that each of these clocks has a different tick period, as the ticks per iteration is vastly different for each clock. However when converted to a known unit of time (e.g. picoseconds), I get approximately the same result for each clock (your mileage may vary).
Note how my code is completely free of "magic conversion constants". Indeed, there are only two magic numbers in the entire example:
x::clock
.You can use the following function with gcc running under x86 processors:
unsigned long long rdtsc()
{
#define rdtsc(low, high) \
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))
unsigned int low, high;
rdtsc(low, high);
return ((ulonglong)high << 32) | low;
}
with Digital Mars C++:
unsigned long long rdtsc()
{
_asm
{
rdtsc
}
}
which reads the high performance timer on the chip. I use this when doing profiling.
Here is a nice Boost timer that works well:
//Stopwatch.hpp
#ifndef STOPWATCH_HPP
#define STOPWATCH_HPP
//Boost
#include <boost/chrono.hpp>
//Std
#include <cstdint>
class Stopwatch
{
public:
Stopwatch();
virtual ~Stopwatch();
void Restart();
std::uint64_t Get_elapsed_ns();
std::uint64_t Get_elapsed_us();
std::uint64_t Get_elapsed_ms();
std::uint64_t Get_elapsed_s();
private:
boost::chrono::high_resolution_clock::time_point _start_time;
};
#endif // STOPWATCH_HPP
//Stopwatch.cpp
#include "Stopwatch.hpp"
Stopwatch::Stopwatch():
_start_time(boost::chrono::high_resolution_clock::now()) {}
Stopwatch::~Stopwatch() {}
void Stopwatch::Restart()
{
_start_time = boost::chrono::high_resolution_clock::now();
}
std::uint64_t Stopwatch::Get_elapsed_ns()
{
boost::chrono::nanoseconds nano_s = boost::chrono::duration_cast<boost::chrono::nanoseconds>(boost::chrono::high_resolution_clock::now() - _start_time);
return static_cast<std::uint64_t>(nano_s.count());
}
std::uint64_t Stopwatch::Get_elapsed_us()
{
boost::chrono::microseconds micro_s = boost::chrono::duration_cast<boost::chrono::microseconds>(boost::chrono::high_resolution_clock::now() - _start_time);
return static_cast<std::uint64_t>(micro_s.count());
}
std::uint64_t Stopwatch::Get_elapsed_ms()
{
boost::chrono::milliseconds milli_s = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - _start_time);
return static_cast<std::uint64_t>(milli_s.count());
}
std::uint64_t Stopwatch::Get_elapsed_s()
{
boost::chrono::seconds sec = boost::chrono::duration_cast<boost::chrono::seconds>(boost::chrono::high_resolution_clock::now() - _start_time);
return static_cast<std::uint64_t>(sec.count());
}