How to enable CUDA 7.0+ per-thread default stream in Visual Studio 2013?

问题

I followed the method provided in GPU Pro Tip: CUDA 7 Streams Simplify Concurrency and tested it in VS2013 with CUDA 7.5. While the multi-stream example worked, the multi-threading one did not give the expected result. The code is as below:

#include <pthread.h>
#include <cstdio>
#include <cmath>

#define CUDA_API_PER_THREAD_DEFAULT_STREAM

#include "cuda.h"

const int N = 1 << 20;

__global__ void kernel(float *x, int n)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        x[i] = sqrt(pow(3.14159, i));
    }
}

void *launch_kernel(void *dummy)
{
    float *data;
    cudaMalloc(&data, N * sizeof(float));

    kernel << <1, 64 >> >(data, N);

    cudaStreamSynchronize(0);

    return NULL;
}

int main()
{
    const int num_threads = 8;

    pthread_t threads[num_threads];

    for (int i = 0; i < num_threads; i++) {
        if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {
            fprintf(stderr, "Error creating threadn");
            return 1;
        }
    }

    for (int i = 0; i < num_threads; i++) {
        if (pthread_join(threads[i], NULL)) {
            fprintf(stderr, "Error joining threadn");
            return 2;
        }
    }

    cudaDeviceReset();

    return 0;
}

I also tried to add the macro CUDA_API_PER_THREAD_DEFAULT_STREAM to CUDA C/C++->Host->Preprocessor Definitions, but the result was the same. The timeline generated by the Profiler is as below:

Do you have any idea on what happened here? Many thanks in advance.

回答1:

The code you have posted works for me as you would expect:

when compiled and run on a Linux system with CUDA 7.0 like so:

$ nvcc -arch=sm_30  --default-stream per-thread -o thread.out thread.cu

From that I can only assume that either you have a platform specific issue, or your build method is incorrect (note that --default-stream per-thread must be specified for every translation unit in the build).

回答2:

Updates: the concurrency may happen when I added a "cudaFree" as shown below. Is it because of the lack of synchronization?

void *launch_kernel(void *dummy)
{
    float *data;
    cudaMalloc(&data, N * sizeof(float));

    kernel << <1, 64 >> >(data, N);
    cudaFree(data); // Concurrency may happen when I add this line
    cudaStreamSynchronize(0);

    return NULL;
}

with the compilation like:

nvcc -arch=sm_30  --default-stream per-thread -lpthreadVC2 kernel.cu -o kernel.exe

来源：https://stackoverflow.com/questions/34259948/how-to-enable-cuda-7-0-per-thread-default-stream-in-visual-studio-2013

标签

c++

multithreading

visual-studio-2013

cuda