This research paper runs a series of several CUDA microbenchmarks on a GPU to obtain statistics like global memory latency, instruction throughput, etc. This link is the link to the set of microbenchmarks that the authors wrote and ran on their GPU.
One of the microbenchmarks called global.cu
gives the code for a pointer-chasing benchmark to measure global memory latency.
This is the code of the kernel that is run.
__global__ void global_latency (unsigned int ** my_array, int array_length, int iterations, int ignore_iterations, unsigned long long * duration) {
unsigned int start_time, end_time;
unsigned int *j = (unsigned int*)my_array;
volatile unsigned long long sum_time;
sum_time = 0;
duration[0] = 0;
for (int k = -ignore_iterations; k < iterations; k++) {
if (k==0) {
sum_time = 0; // ignore some iterations: cold icache misses
}
start_time = clock();
repeat256(j=*(unsigned int **)j;) // unroll macro, simply creates an unrolled loop of 256 instructions, nothing more
end_time = clock();
sum_time += (end_time - start_time);
}
((unsigned int*)my_array)[array_length] = (unsigned int)j;
((unsigned int*)my_array)[array_length+1] = (unsigned int) sum_time;
duration[0] = sum_time;
}
The line of code performing the pointer chasing in the case of 32-bit pointers is:
j = *(unsigned int**)j;
This is the key line, because the remaining lines of code are only used for time measurement.
I tried to run this on my GPU, but I faced an issue. Running the same microbenchmark with no changes gives me a runtime error of An illegal memory access was encountered
.
In the same link they explain that:
The global memory tests use pointer chasing code where the pointer values are stored in an array. Pointers on GT200 are 32 bits. The global memory test will need to be changed if the pointer size changes, e.g., 64-bit pointers on Fermi.
It turns out that my GPU is of Kepler architecture, which has 64-bit pointers.
How do I modify that bit of pointer-chasing code which originally deals with 32-bit pointers, in order to measure global memory latency using 64-bit pointers?
Edit:
From havogt's answer: An important piece of information that I should have included in the question is this portion of the code, where an array of memory locations is built where each entry points to the entry for the next pointer.
for (i = 0; i < N; i += step) {
// Device pointers are 32-bit on GT200.
h_a[i] = ((unsigned int)(uintptr_t)d_a) + ((i + stride) % N)*sizeof(unsigned int);
}
Introduction
Before I explain what you have to do to make the code working let me emphasize the following: You should have a very good understanding of the hardware you are testing and the design of your microbenchmark. Why is it important? The original code was designed for the GT200 which did not have a cache for ordinary global memory loads. If you now just fix the pointer problem you will measure basically the L2 latency (on Kepler, where by default L1 is not used) because the original code uses a very small memory which fits nicely into the cache.
Disclaimer: For me it is also the first time to study such a benchmarking code. Therefore, check carefully before you use the code below. I do not guarantee that I did not make mistakes, when transforming the original code.
The simple solution (measures basically the cache latency)
First, you did not include all relevant parts of the code in your question. The most important part is
for (i = 0; i < N; i += step) {
// Device pointers are 32-bit on GT200.
h_a[i] = ((unsigned int)(uintptr_t)d_a) + ((i + stride) % N)*sizeof(unsigned int);
}
where an array of memory locations is built where each entry points to the entry for the next pointer.
Now all you need to do is replace all unsigned int
(which is used for storing the 32-bit pointers) by unsigned long long int
, both in the setup code and in the kernel.
I won't post the code since I cannot recommend running such code if you don't understand it, see Introduction. If you understand it, then it is simple.
My solution
Basically what I did is to use as much memory as needed to evaluate all pointers or a maximal amount of memory of 1GB. In both cases I wrapped the last entry to the first entry. Note that depending on the stride, a lot of array entries may be uninitialized (because they are never used).
The following code is basically the original code after a bit of clean-up (but it's still not very clean, sorry...) and the change in the memory. I introduced a typedef
typedef unsigned long long int ptrsize_type;
to highlight at which locations the unsigned int
from the original code has to be replaced with unsigned long long int
. I used the repeat1024
macro (from the original code) which just copies the line j=*(ptrsize_type **)j;
1024 times.
The strides can be adjusted in measure_global_latency()
. In the output the stride is given in bytes.
I leave the interpretation of the latency for the different strides to you. The strides need to be adjusted such that you do not reuse the cache!
#include <stdio.h>
#include <stdint.h>
#include "repeat.h"
typedef unsigned long long int ptrsize_type;
__global__ void global_latency (ptrsize_type** my_array, int array_length, int iterations, unsigned long long * duration) {
unsigned long long int start_time, end_time;
ptrsize_type *j = (ptrsize_type*)my_array;
volatile unsigned long long int sum_time;
sum_time = 0;
for (int k = 0; k < iterations; k++)
{
start_time = clock64();
repeat1024(j=*(ptrsize_type **)j;)
end_time = clock64();
sum_time += (end_time - start_time);
}
((ptrsize_type*)my_array)[array_length] = (ptrsize_type)j;
((ptrsize_type*)my_array)[array_length+1] = (ptrsize_type) sum_time;
duration[0] = sum_time;
}
void parametric_measure_global(int N, int iterations, unsigned long long int maxMem, int stride)
{
unsigned long long int maxMemToArraySize = maxMem / sizeof( ptrsize_type );
unsigned long long int maxArraySizeNeeded = 1024*iterations*stride;
unsigned long long int maxArraySize = (maxMemToArraySize<maxArraySizeNeeded)?(maxMemToArraySize):(maxArraySizeNeeded);
ptrsize_type* h_a = new ptrsize_type[maxArraySize+2];
ptrsize_type** d_a;
cudaMalloc ((void **) &d_a, (maxArraySize+2)*sizeof(ptrsize_type));
unsigned long long int* duration;
cudaMalloc ((void **) &duration, sizeof(unsigned long long int));
for ( int i = 0; true; i += stride)
{
ptrsize_type nextAddr = ((ptrsize_type)d_a)+(i+stride)*sizeof(ptrsize_type);
if( i+stride < maxArraySize )
{
h_a[i] = nextAddr;
}
else
{
h_a[i] = (ptrsize_type)d_a; // point back to the first entry
break;
}
}
cudaMemcpy((void *)d_a, h_a, (maxArraySize+2)*sizeof(ptrsize_type), cudaMemcpyHostToDevice);
unsigned long long int latency_sum = 0;
int repeat = 1;
for (int l=0; l <repeat; l++)
{
global_latency<<<1,1>>>(d_a, maxArraySize, iterations, duration);
cudaThreadSynchronize ();
cudaError_t error_id = cudaGetLastError();
if (error_id != cudaSuccess)
{
printf("Error is %s\n", cudaGetErrorString(error_id));
}
unsigned long long int latency;
cudaMemcpy( &latency, duration, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
latency_sum += latency;
}
cudaFree(d_a);
cudaFree(duration);
delete[] h_a;
printf("%f\n", (double)(latency_sum/(repeat*1024.0*iterations)) );
}
void measure_global_latency()
{
int maxMem = 1024*1024*1024; // 1GB
int N = 1024;
int iterations = 1;
for (int stride = 1; stride <= 1024; stride+=1)
{
printf (" %5d, ", stride*sizeof( ptrsize_type ));
parametric_measure_global( N, iterations, maxMem, stride );
}
for (int stride = 1024; stride <= 1024*1024; stride+=1024)
{
printf (" %5d, ", stride*sizeof( ptrsize_type ));
parametric_measure_global( N, iterations, maxMem, stride );
}
}
int main()
{
measure_global_latency();
return 0;
}
Edit:
Some more details to the comments: I did not include the interpretation of the result because I do not consider myself an expert on such benchmarks. It was not my intend to make the interpretation an exercise to the reader.
Now here is my interpretation: I get the same results for Kepler GPUs (with L1 not available/disabled). Something below 200 cycles for a L2 read is what you get with a small stride. The accuracy can be improved by increasing the iterations
variable to definitely reuse L2.
The tricky task is now to find a stride that does not reuse the L2 cache. In my approach I just blindly try many different (large) strides and hope that L2 is not reused. There, I also get something around ~500 cycles. Of course, the better approach would be to think more about the structure of the cache and deduce the correct stride by reasoning and not by trial and error. That's the main reason why I don't wanted to interpret the result myself.
Why is the latency decreasing again for strides > 1MB? The reason for this behaviour is that I used a fixed size of 1GB for the maximal memory usage. With the 1024 pointer lookups (repeat1024
), a stride of 1MB just fits in the memory. Larger strides will wrap-around and use again data from the L2 cache. The main problem with the current code is that the 1024 pointer (1024*64 bit) still fit perfectly in the L2 cache.
This introduces another trap: If you set the number of iterations
to something > 1 and exceed the memory limit with 1024*iterations*stride*sizeof(ptrsize_type)
you will again use the L2 cache.
Possible solution:
- Instead of wrapping the last entry to the first element, one should implement a smarter wrapping to an (unused!) location which is between the size of the cache-line and the stride. But you need to be very careful that you do not overwrite memory locations, especially if you are wrapping around multiple times.
来源:https://stackoverflow.com/questions/36416843/how-to-write-a-pointer-chasing-benchmark-using-64-bit-pointers-in-cuda