问题
I have two tasks. Each of them perform copy to device (D), run kernel (R), and copy to host (H) operations. I am overlapping copy to device of task2 (D2) with run kernel of task1 (R1). In addition, I am overlapping run kernel of task2 (R2) with copy to host of task1 (H1).
I also record start and stop time of D, R, H ops of each task using cudaEventRecord.
I have GeForce GT 555M, CUDA 4.1, and Fedora 16.
I have three scenarios:
Scenario1: I use one stream for each task. I place start/stop events right before/after the ops.
Scenario2: I use one stream for each task. I place the start event of the second of the overlapping ops before the start of first one (i.e. place start R1 before start D2, and place start H1 before start R2).
Scenario3: I use two streams for each task. I use cudaStreamWaitEvents to synchronize between these two streams. One stream is used for D and H (copy) ops, the other one is used for R op. I place start/stop events right before/after the ops.
Scenario1 fails to overlap ops (neither D2-R1 nor R2-H1 can be overlapped), whereas Scenario2 and Scenario3 succeed. And my question is: Why Scenerio1 fails while the other ones succeed?
For each scenario I measure the overall time for performing Task1 and Task2. Running both R1 and R2 takes 5 ms each. Since Scenario1 fails to overlap ops, the overall time is 10ms more than Scenario 2 and 3.
Here are the pseudo-code for scenarios:
Scenario1 (FAILS): use stream1 for task1, use stream2 for task2
start overall
start D1 on stream1
D1 on stream1
stop D1 on stream1
start D2 on stream2
D2 on stream2
stop D2 on stream2
start R1 on stream1
R1 on stream1
stop R1 on stream1
start R2 on stream2
R2 on stream2
stop R2 on stream2
start H1 on stream1
H1 on stream1
stop H1 on stream1
start H2 on stream2
H2 on stream2
stop H2 on stream2
stop overall
Scenario2 (SUCCEEDS): use stream1 for task1, use stream2 for task2, move-up the start event of the second of the overlaping ops.
start overall
start D1 on stream1
D1 on stream1
stop D1 on stream1
start R1 on stream1 //moved-up
start D2 on stream2
D2 on stream2
stop D2 on stream2
R1 on stream1
stop R1 on stream1
start H1 on stream1 //moved-up
start R2 on stream2
R2 on stream2
stop R2 on stream2
H1 on stream1
stop H1 on stream1
start H2 on stream2
H2 on stream2
stop H2 on stream2
stop overall
Scenario3 (SUCCEEDS): use stream1 and 3 for task1, use stream2 and 4 for task2
start overall
start D1 on stream1
D1 on stream1
stop D1 on stream1
start D2 on stream2
D2 on stream2
stop D2 on stream2
start R1 on stream3
R1 on stream3
stop R1 on stream3
start R2 on stream4
R2 on stream4
stop R2 on stream4
start H1 on stream1
H1 on stream1
stop H1 on stream1
start H2 on stream2
H2 on stream2
stop H2 on stream2
stop overall
Here are the overall timing info for all Scenarios: Scenario1 = 39.390240 Scenario2 = 29.190241 Scenario3 = 29.298208
I also attach the CUDA code below:
#include <stdio.h>
#include <cuda_runtime.h>
#include <sys/time.h>
__global__ void VecAdd(const float* A, const float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
{
C[i] = A[i] + B[N-i];
C[i] = A[i] + B[i] * 2;
C[i] = A[i] + B[i] * 3;
C[i] = A[i] + B[i] * 4;
C[i] = A[i] + B[i];
}
}
void overlap()
{
float* h_A;
float *d_A, *d_C;
float* h_A2;
float *d_A2, *d_C2;
int N = 10000000;
size_t size = N * sizeof(float);
cudaMallocHost((void**) &h_A, size);
cudaMallocHost((void**) &h_A2, size);
// Allocate vector in device memory
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_C, size);
cudaMalloc((void**)&d_A2, size);
cudaMalloc((void**)&d_C2, size);
float fTimCpyDev1, fTimKer1, fTimCpyHst1, fTimCpyDev2, fTimKer2, fTimCpyHst2;
float fTimOverall3, fTimOverall1, fTimOverall2;
for (int i = 0; i<N; ++i)
{
h_A[i] = 1;
h_A2[i] = 5;
}
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
cudaStream_t csStream1, csStream2, csStream3, csStream4;
cudaStreamCreate(&csStream1);
cudaStreamCreate(&csStream2);
cudaStreamCreate(&csStream3);
cudaStreamCreate(&csStream4);
cudaEvent_t ceEvStart, ceEvStop;
cudaEventCreate( &ceEvStart );
cudaEventCreate( &ceEvStop );
cudaEvent_t ceEvStartCpyDev1, ceEvStopCpyDev1, ceEvStartKer1, ceEvStopKer1, ceEvStartCpyHst1, ceEvStopCpyHst1;
cudaEventCreate( &ceEvStartCpyDev1 );
cudaEventCreate( &ceEvStopCpyDev1 );
cudaEventCreate( &ceEvStartKer1 );
cudaEventCreate( &ceEvStopKer1 );
cudaEventCreate( &ceEvStartCpyHst1 );
cudaEventCreate( &ceEvStopCpyHst1 );
cudaEvent_t ceEvStartCpyDev2, ceEvStopCpyDev2, ceEvStartKer2, ceEvStopKer2, ceEvStartCpyHst2, ceEvStopCpyHst2;
cudaEventCreate( &ceEvStartCpyDev2 );
cudaEventCreate( &ceEvStopCpyDev2 );
cudaEventCreate( &ceEvStartKer2 );
cudaEventCreate( &ceEvStopKer2 );
cudaEventCreate( &ceEvStartCpyHst2 );
cudaEventCreate( &ceEvStopCpyHst2 );
//Scenario1
cudaDeviceSynchronize();
cudaEventRecord(ceEvStart, 0);
cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);
cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);
cudaEventRecord(ceEvStartKer1, csStream1);
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream1>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream1);
cudaEventRecord(ceEvStartKer2, csStream2);
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream2>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream2);
cudaEventRecord(ceEvStartCpyHst1, csStream1);
cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);
cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);
cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();
cudaEventElapsedTime( &fTimOverall1, ceEvStart, ceEvStop);
printf("Scenario1 overall time= %10f\n", fTimOverall1);
//Scenario2
cudaDeviceSynchronize();
cudaEventRecord(ceEvStart, 0);
cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);
cudaEventRecord(ceEvStartKer1, csStream1); //moved up
cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream1>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream1);
cudaEventRecord(ceEvStartCpyHst1, csStream1); //moved up
cudaEventRecord(ceEvStartKer2, csStream2);
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream2>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream2);
cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);
cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);
cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();
cudaEventElapsedTime( &fTimOverall2, ceEvStart, ceEvStop);
printf("Scenario2 overall time= %10f\n", fTimOverall2);
//Scenario3
cudaDeviceSynchronize();
cudaEventRecord(ceEvStart, 0);
cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);
cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);
cudaStreamWaitEvent(csStream3, ceEvStopCpyDev1, 0);
cudaEventRecord(ceEvStartKer1, csStream3);
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream3>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream3);
cudaStreamWaitEvent(csStream4, ceEvStopCpyDev2, 0);
cudaEventRecord(ceEvStartKer2, csStream4);
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream4>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream4);
cudaStreamWaitEvent(csStream1, ceEvStopKer1, 0);
cudaEventRecord(ceEvStartCpyHst1, csStream1);
cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);
cudaStreamWaitEvent(csStream2, ceEvStopKer2, 0);
cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);
cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();
cudaEventElapsedTime( &fTimOverall3, ceEvStart, ceEvStop);
printf("Scenario3 overall time = %10f\n", fTimOverall3);
cudaStreamDestroy(csStream1);
cudaStreamDestroy(csStream2);
cudaStreamDestroy(csStream3);
cudaStreamDestroy(csStream4);
cudaFree(d_A);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFree(d_A2);
cudaFree(d_C2);
cudaFreeHost(h_A2);
}
int main()
{
overlap();
}
Thank you very much for your time in advance!
回答1:
(Note, I'm more familiar with the Tesla series devices, and don't actually have a GT 555M to experiment with, so my results refer specifically to a C2070. I don't know how many copy engines the 555m has, but I expect the issues described below are what's causing the behavior you are seeing.)
The issue is the lesser-known fact that the cudaEventRecords are CUDA operations too, and they also must be placed in one of the hardware queues before getting launched/executed. (A complicating factor is that, since cudaEventRecord is neither a copy operation, nor a compute kernel, it can actually go in any hardware queue. My understanding is that they usually go in the same hardware queue as the preceding CUDA operation of the same stream, but as this is not specified in the docs the actual operation may be device/driver dependent.)
If I can extend your notation to use 'E' for 'Event record', and detail how the hardware queues are filled (similar to what is done in the "CUDA C/C++ Streams and Concurrency" webinar) then, in your Scenario 1 example, you have:
Issue order for CUDA operations:
ED1
D1
ED1
ED2
D2
ED2
ER1
R1
ER1
...
These fill the queues like:
Hardware Queues: copyH2D Kernel
------- ------
ED1 * R1
D1 / ER1
ED1 / ...
ED2 /
D2 /
ED2 /
ER1 *
and you can see that R1, by virtue of being in stream 1, will not execute until ER1 has completed, which won't happen until both D1 and D2 have completed since they are all serialized in the H2D copy queue.
By moving the cudaEventRecord, ER1, up in Scenario 2, you avoid this since all CUDA operations in stream 1, prior to R1, complete before D2. This permits R1 to launch concurrently to D2.
Hardware Queues: copyH2D Kernel
------- ------
ED1 * R1
D1 / ER1
ED1 / ...
ER1 *
ED2
D2
ED2
In your Scenario 3, the ER1 is replaced with an ER3. As this is the first operation in stream 3, it can go anywhere, and (guessing) goes in either the Kernel or copy D2H queue from which it could get launched immediately, (if you didn't have the
cudaStreamWaitEvent(csStream3, ceEvStopCpyDev1, 0);
for synchronization with stream 1) so it does not cause false serialization with D2.
Hardware Queues: copyH2D Kernel
------- ------
ED1 * ER3
D1 / R3
ED1 * ER3
ED2 ...
D2
ED2
My comments would be
- Issue order for CUDA operations is very important when considering concurrency
- cudaEventRecord, and similar operations, get placed on hardware queues like everything else and can cause false serialization. Exactly how they get placed in hardware queues is not well described, and could be device/driver dependent. So for optimal concurrency, the use of cudaEventRecord and similar operations should be reduced to the minimum necessary.
- If kernels need to be timed for performance studies, that can be done using events but it will break concurrency. This is fine for development but should be avoided for production code.
However you should note that the upcoming Kepler GK110 (Tesla K20) devices make significant improvements in reducing false serialization by using 32 hardware queues. See the GK110 Whitepaper for details (page 17).
Hope this helps.
来源:https://stackoverflow.com/questions/11996748/location-of-cudaeventrecord-and-overlapping-ops-from-different-streams