CUDA: Why Thrust is so slow on uploading data to GPU?

问题

I'm new to GPU world and just installed CUDA for writing some program. I played with thrust library but find out that it is so slow when uploading data to GPU. Just about 35MB/s in host-to-device part on my not-bad desktop. How come it is?

Environment: Visual Studio 2012, CUDA 5.0, GTX760, Intel-i7, Windows 7 x64

GPU Bandwidth test: enter image description here

It is supposed to have at least 11GB/s of transfer speed for host to device or vice versa! But it didn't!

Here's the test program:

#include <iostream>
#include <ctime>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#define N 32<<22

int main(void)
{
    using namespace std;

    cout<<"GPU bandwidth test via thrust, data size: "<< (sizeof(double)*N) / 1000000000.0 <<" Gbytes"<<endl;
    cout<<"============program start=========="<<endl;

    int now = time(0);
    cout<<"Initializing h_vec...";
    thrust::host_vector<double> h_vec(N,0.0f);
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl;

    now = time(0);
    cout<<"Uploading data to GPU...";
    thrust::device_vector<double> d_vec = h_vec;
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl;

    now = time(0);
    cout<<"Downloading data to h_vec...";
    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl<<endl;

    system("PAUSE");
    return 0;
}

Program out put: enter image description here

Download speed: less than 1 sec, pretty make sense compare to nominal 11GB/s.
Upload speed: 1.07374GB /32 secs is about to be 33.5 MB/s, which doesn't make sense at all.

Does anyone know the reason? Or is it just the way thrust is?

Thanks!!

回答1:

Your comparison has several flaws, some of which are covered in the comments.

You need to eliminate any allocation effects. You can do this by doing some "warm-up" transfers first.
You need to eliminate any "start-up" effects. You can do this by doing some "warm-up" transfers first.
When comparing the data, remember that bandwidthTest is using a PINNED memory allocation, which thrust does not use. Therefore the thrust data transfer rate will be slower. This typically contributes about a 2x factor (i.e. pinned memory transfers are typically about 2x faster than pageable memory transfers. If you want a better comparison with bandwidthTest run it with the --memory=pageable switch.
Your choice of timing functions might not be the best. cudaEvents is pretty reliable for timing CUDA operations.

Here is a code which does proper timing:

$ cat t213.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/fill.h>

#define DSIZE ((1UL<<20)*32)

int main(){

  thrust::device_vector<int> d_data(DSIZE);
  thrust::host_vector<int> h_data(DSIZE);
  float et;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  thrust::fill(h_data.begin(), h_data.end(), 1);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());

  std::cout<< "warm up iteration " << d_data[0] << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 2);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  std::cout<< "warm up iteration " << h_data[0] << std::endl;
  thrust::fill(h_data.begin(), h_data.end(), 3);
  cudaEventRecord(start);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"host to device iteration " << d_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 4);
  cudaEventRecord(start);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"device to host iteration " << h_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;

  std::cout << "finished" << std::endl;
  return 0;
}

I compile with (I have a PCIE Gen2 system with a cc2.0 device)

$ nvcc -O3 -arch=sm_20 -o t213 t213.cu

When I run it I get the following results:

$ ./t213
warm up iteration 1
warm up iteration 2
host to device iteration 3 elapsed time: 0.0476644
apparent bandwidth: 2685.44 MB/s
device to host iteration 4 elapsed time: 0.0500736
apparent bandwidth: 2556.24 MB/s
finished
$

This looks correct to me because a bandwidthTest on my system would report about 6GB/s in either direction as I have a PCIE Gen2 system. Since thrust uses pageable, not pinned memory, I get about half that bandwidth, i.e. 3GB/s, and thrust is reporting about 2.5GB/s.

For comparison, here is the bandwidth test on my system, using pageable memory:

$ /usr/local/cuda/samples/bin/linux/release/bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro 5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2718.2

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2428.2

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     99219.1

$

来源：https://stackoverflow.com/questions/17987045/cuda-why-thrust-is-so-slow-on-uploading-data-to-gpu

标签

cuda

gpu

nvidia

thrust