cuda-streams | 易学教程

The behavior of stream 0 (default) and other streams

阅读更多关于 The behavior of stream 0 (default) and other streams

问题 In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not? Considering the following example: cudaMemcpy(Dst, Src, sizeof(float)*datasize, cudaMemcpyHostToDevice);//stream 0; cudaStream_t stream1; /...creating stream1.../ somekernel<<<blocks, threads, 0, stream1>>>(Dst);//stream 1; In the above code, can the compiler ensure somekernel always launches AFTER cudaMemcpy finishes or will somekernel execuate

Is it possible to manually set the SMs used for one CUDA stream?

阅读更多关于 Is it possible to manually set the SMs used for one CUDA stream?

问题 By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal SMs used for 2 stream respectively (after setting the maximal SMs, the kernel in one stream will use up to maximal SMs, like 20SMs for computational-intense and 4SMs for memory-intense), is it possible to do so? (if possible, which API should I use) 回答1: In short, no there is no way to do what

Is it possible to manually set the SMs used for one CUDA stream?

阅读更多关于 Is it possible to manually set the SMs used for one CUDA stream?

Reading updated memory from other CUDA stream

阅读更多关于 Reading updated memory from other CUDA stream

问题 I am trying to set a flag in one kernel function and read it in another. Basically, I'm trying to do the following. #include <iostream> #include <cuda.h> #include <cuda_runtime.h> #define FLAGCLEAR 0 #define FLAGSET 1 using namespace std; __global__ void set_flag(int *flag) { *flag = FLAGSET; // Wait for flag to reset. while (*flag == FLAGSET); } __global__ void read_flag(int *flag) { // wait for the flag to set. while (*flag != FLAGSET); // Clear it for next time. *flag = FLAGCLEAR; } int

CUDA streams not overlapping

阅读更多关于 CUDA streams not overlapping

问题 I have something very similar to the code: int k, no_streams = 4; cudaStream_t stream[no_streams]; for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]); cudaMalloc(&g_in, size1*no_streams); cudaMalloc(&g_out, size2*no_streams); for (k = 0; k < no_streams; k++) cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]); for (k = 0; k < no_streams; k++) mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof

Get rid of busy waiting during asynchronous cuda stream executions

阅读更多关于 Get rid of busy waiting during asynchronous cuda stream executions

问题 I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs): cudaStream_t steams[S_N]; for (int i = 0; i < S_N; i++) { cudaStreamCreate(streams[i]); } int sid = 0; for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) { while (true) { if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!! cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);

Stream scheduling order

阅读更多关于 Stream scheduling order

问题 The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong? allOfData_A= data_A1 + data_A2 allOfData_B= data_B1 + data_B2 allOFData_C= data_C1 + data_C2 Data_C is the output of the kernel operation of both Data_A & Data_B. (Like C=A+B) The HW supports one DeviceOverlap (concurrent) operation. Process One: MemcpyAsync data_A1 stream1 H->D MemcpyAsync data_A2 stream2 H->D MemcpyAsync data_B1 stream1 H->D MemcpyAsync data_B2 stream2

How to reduce CUDA synchronize latency / delay

阅读更多关于 How to reduce CUDA synchronize latency / delay

问题 This question is related to using cuda streams to run many kernels In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty. I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible. Also is there

How to reduce CUDA synchronize latency / delay

阅读更多关于 How to reduce CUDA synchronize latency / delay

This question is related to using cuda streams to run many kernels In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty. I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible. Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an

CUDA streams not overlapping

阅读更多关于 CUDA streams not overlapping

I have something very similar to the code: int k, no_streams = 4; cudaStream_t stream[no_streams]; for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]); cudaMalloc(&g_in, size1*no_streams); cudaMalloc(&g_out, size2*no_streams); for (k = 0; k < no_streams; k++) cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]); for (k = 0; k < no_streams; k++) mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float)); for (k = 0; k < no_streams; k++) cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float),