问题
I followed the method provided in GPU Pro Tip: CUDA 7 Streams Simplify Concurrency and tested it in VS2013 with CUDA 7.5. While the multi-stream example worked, the multi-threading one did not give the expected result. The code is as below:
#include <pthread.h>
#include <cstdio>
#include <cmath>
#define CUDA_API_PER_THREAD_DEFAULT_STREAM
#include "cuda.h"
const int N = 1 << 20;
__global__ void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159, i));
}
}
void *launch_kernel(void *dummy)
{
float *data;
cudaMalloc(&data, N * sizeof(float));
kernel << <1, 64 >> >(data, N);
cudaStreamSynchronize(0);
return NULL;
}
int main()
{
const int num_threads = 8;
pthread_t threads[num_threads];
for (int i = 0; i < num_threads; i++) {
if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {
fprintf(stderr, "Error creating threadn");
return 1;
}
}
for (int i = 0; i < num_threads; i++) {
if (pthread_join(threads[i], NULL)) {
fprintf(stderr, "Error joining threadn");
return 2;
}
}
cudaDeviceReset();
return 0;
}
I also tried to add the macro CUDA_API_PER_THREAD_DEFAULT_STREAM to CUDA C/C++->Host->Preprocessor Definitions, but the result was the same. The timeline generated by the Profiler is as below:
Do you have any idea on what happened here? Many thanks in advance.
回答1:
The code you have posted works for me as you would expect:
when compiled and run on a Linux system with CUDA 7.0 like so:
$ nvcc -arch=sm_30 --default-stream per-thread -o thread.out thread.cu
From that I can only assume that either you have a platform specific issue, or your build method is incorrect (note that --default-stream per-thread
must be specified for every translation unit in the build).
回答2:
Updates: the concurrency may happen when I added a "cudaFree" as shown below. Is it because of the lack of synchronization?
void *launch_kernel(void *dummy)
{
float *data;
cudaMalloc(&data, N * sizeof(float));
kernel << <1, 64 >> >(data, N);
cudaFree(data); // Concurrency may happen when I add this line
cudaStreamSynchronize(0);
return NULL;
}
with the compilation like:
nvcc -arch=sm_30 --default-stream per-thread -lpthreadVC2 kernel.cu -o kernel.exe
来源:https://stackoverflow.com/questions/34259948/how-to-enable-cuda-7-0-per-thread-default-stream-in-visual-studio-2013