copy from GPU to CPU is slower than copying CPU to GPU

前端 未结 2 1207
生来不讨喜
生来不讨喜 2021-01-18 16:18

I have started learning cuda for a while and I have the following problem

See how I am doing below:

Copy GPU

int* B;
// ...         


        
2条回答
  •  悲哀的现实
    2021-01-18 16:45

    Instead of using clock() to measure time, you should use events:

    With events you would have something like this:

      cudaEvent_t start, stop;   // variables that holds 2 events 
      float time;                // Variable that will hold the time
      cudaEventCreate(&start);   // creating the event 1
      cudaEventCreate(&stop);    // creating the event 2
      cudaEventRecord(start, 0); // start measuring  the time
    
      // What you want to measure
      cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
      cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
    
      cudaEventRecord(stop, 0);                  // Stop time measuring
      cudaEventSynchronize(stop);               // Wait until the completion of all device 
                                                // work preceding the most recent call to cudaEventRecord()
    
      cudaEventElapsedTime(&time, start, stop); // Saving the time measured
    

    EDIT : Additional information :

    "The kernel launch returns control to the CPU thread before it is finished. Therefore your timing construct is measuring both the kernel execution time as well as the 2nd memcpy. When timing the copy after the kernel, your timer code is being executed immediately, but the cudaMemcpy is waiting for the kernel to complete before it starts. This also explains why your timing measurement for the data return seems to vary based on kernel loop iterations. It also explains why the time spent on your kernel function is "negligible"". credits to Robert Crovella

提交回复
热议问题