How to write a pointer-chasing benchmark using 64-bit pointers in CUDA?

与世无争的帅哥 提交于 2019-12-05 05:57:18

Introduction

Before I explain what you have to do to make the code working let me emphasize the following: You should have a very good understanding of the hardware you are testing and the design of your microbenchmark. Why is it important? The original code was designed for the GT200 which did not have a cache for ordinary global memory loads. If you now just fix the pointer problem you will measure basically the L2 latency (on Kepler, where by default L1 is not used) because the original code uses a very small memory which fits nicely into the cache.

Disclaimer: For me it is also the first time to study such a benchmarking code. Therefore, check carefully before you use the code below. I do not guarantee that I did not make mistakes, when transforming the original code.

The simple solution (measures basically the cache latency)

First, you did not include all relevant parts of the code in your question. The most important part is

for (i = 0; i < N; i += step) {
    // Device pointers are 32-bit on GT200.
    h_a[i] = ((unsigned int)(uintptr_t)d_a) + ((i + stride) % N)*sizeof(unsigned int);
}

where an array of memory locations is built where each entry points to the entry for the next pointer. Now all you need to do is replace all unsigned int (which is used for storing the 32-bit pointers) by unsigned long long int, both in the setup code and in the kernel.

I won't post the code since I cannot recommend running such code if you don't understand it, see Introduction. If you understand it, then it is simple.

My solution

Basically what I did is to use as much memory as needed to evaluate all pointers or a maximal amount of memory of 1GB. In both cases I wrapped the last entry to the first entry. Note that depending on the stride, a lot of array entries may be uninitialized (because they are never used).

The following code is basically the original code after a bit of clean-up (but it's still not very clean, sorry...) and the change in the memory. I introduced a typedef

typedef unsigned long long int ptrsize_type;

to highlight at which locations the unsigned int from the original code has to be replaced with unsigned long long int. I used the repeat1024 macro (from the original code) which just copies the line j=*(ptrsize_type **)j; 1024 times.

The strides can be adjusted in measure_global_latency(). In the output the stride is given in bytes.

I leave the interpretation of the latency for the different strides to you. The strides need to be adjusted such that you do not reuse the cache!

#include <stdio.h> 
#include <stdint.h>

#include "repeat.h"

typedef unsigned long long int ptrsize_type;

__global__ void global_latency (ptrsize_type** my_array, int array_length, int iterations, unsigned long long * duration) {

    unsigned long long int start_time, end_time;
    ptrsize_type *j = (ptrsize_type*)my_array;
    volatile unsigned long long int sum_time;

    sum_time = 0;

    for (int k = 0; k < iterations; k++)
    {

        start_time = clock64();
        repeat1024(j=*(ptrsize_type **)j;)
        end_time = clock64();

        sum_time += (end_time - start_time);
    }

    ((ptrsize_type*)my_array)[array_length] = (ptrsize_type)j;
    ((ptrsize_type*)my_array)[array_length+1] = (ptrsize_type) sum_time;
    duration[0] = sum_time;
}

void parametric_measure_global(int N, int iterations, unsigned long long int maxMem, int stride)
{
    unsigned long long int maxMemToArraySize = maxMem / sizeof( ptrsize_type );
    unsigned long long int maxArraySizeNeeded = 1024*iterations*stride;
    unsigned long long int maxArraySize = (maxMemToArraySize<maxArraySizeNeeded)?(maxMemToArraySize):(maxArraySizeNeeded);

    ptrsize_type* h_a = new ptrsize_type[maxArraySize+2];
    ptrsize_type** d_a;
    cudaMalloc ((void **) &d_a, (maxArraySize+2)*sizeof(ptrsize_type));

    unsigned long long int* duration;
    cudaMalloc ((void **) &duration, sizeof(unsigned long long int));

    for ( int i = 0; true; i += stride)
    {
        ptrsize_type nextAddr = ((ptrsize_type)d_a)+(i+stride)*sizeof(ptrsize_type);
        if( i+stride < maxArraySize )
        {
            h_a[i] = nextAddr;
        }
        else
        {
            h_a[i] = (ptrsize_type)d_a; // point back to the first entry
            break;
        }
    }
    cudaMemcpy((void *)d_a, h_a, (maxArraySize+2)*sizeof(ptrsize_type), cudaMemcpyHostToDevice);

    unsigned long long int latency_sum = 0;
    int repeat = 1;
    for (int l=0; l <repeat; l++)
    {
        global_latency<<<1,1>>>(d_a, maxArraySize, iterations, duration);
        cudaThreadSynchronize ();

        cudaError_t error_id = cudaGetLastError();
        if (error_id != cudaSuccess)
        {
            printf("Error is %s\n", cudaGetErrorString(error_id));
        }

        unsigned long long int latency;
        cudaMemcpy( &latency, duration, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
        latency_sum += latency;
    }

    cudaFree(d_a);
    cudaFree(duration);

    delete[] h_a;
    printf("%f\n", (double)(latency_sum/(repeat*1024.0*iterations)) );
}

void measure_global_latency()
{
    int maxMem = 1024*1024*1024; // 1GB
    int N = 1024;
    int iterations = 1;

    for (int stride = 1; stride <= 1024; stride+=1)
    {
        printf ("  %5d, ", stride*sizeof( ptrsize_type ));
        parametric_measure_global( N, iterations, maxMem, stride );
    }
    for (int stride = 1024; stride <= 1024*1024; stride+=1024)
    {
        printf ("  %5d, ", stride*sizeof( ptrsize_type ));
        parametric_measure_global( N, iterations, maxMem, stride );
    }
}

int main()
{
    measure_global_latency();
    return 0;
}

Edit:

Some more details to the comments: I did not include the interpretation of the result because I do not consider myself an expert on such benchmarks. It was not my intend to make the interpretation an exercise to the reader.

Now here is my interpretation: I get the same results for Kepler GPUs (with L1 not available/disabled). Something below 200 cycles for a L2 read is what you get with a small stride. The accuracy can be improved by increasing the iterations variable to definitely reuse L2.

The tricky task is now to find a stride that does not reuse the L2 cache. In my approach I just blindly try many different (large) strides and hope that L2 is not reused. There, I also get something around ~500 cycles. Of course, the better approach would be to think more about the structure of the cache and deduce the correct stride by reasoning and not by trial and error. That's the main reason why I don't wanted to interpret the result myself.

Why is the latency decreasing again for strides > 1MB? The reason for this behaviour is that I used a fixed size of 1GB for the maximal memory usage. With the 1024 pointer lookups (repeat1024), a stride of 1MB just fits in the memory. Larger strides will wrap-around and use again data from the L2 cache. The main problem with the current code is that the 1024 pointer (1024*64 bit) still fit perfectly in the L2 cache. This introduces another trap: If you set the number of iterations to something > 1 and exceed the memory limit with 1024*iterations*stride*sizeof(ptrsize_type) you will again use the L2 cache.

Possible solution:

  • Instead of wrapping the last entry to the first element, one should implement a smarter wrapping to an (unused!) location which is between the size of the cache-line and the stride. But you need to be very careful that you do not overwrite memory locations, especially if you are wrapping around multiple times.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!