My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers. We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem.
//#include <stdio.h>
//#include <stdlib.h>
//#include <cuda_runtime.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <cutil.h>
__global__ void test_kernel(float *d_data) {
int tid = blockDim.x*blockIdx.x + threadIdx.x;
for (int i=0;i<10000;++i) {
d_data[tid] = float(i*2.2);
d_data[tid] += 3.3;
}
}
int main(int argc, char* argv[])
{
int deviceCount;
cudaGetDeviceCount(&deviceCount);
int device = 0; //SELECT GPU HERE
cudaSetDevice(device);
cudaEvent_t start, stop;
unsigned int num_vals = 200000000;
float *h_data = new float[num_vals];
for (int i=0;i<num_vals;++i) {
h_data[i] = float(i);
}
float *d_data = NULL;
float malloc_timer;
cudaEventCreate(&start);
cudaEventCreate(&stop); cudaEventRecord( start, 0 );
cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_data, sizeof(float)*num_vals);
cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
float mem_timer;
cudaEventCreate(&start);
cudaEventCreate(&stop); cudaEventRecord( start, 0 );
cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);
cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
float kernel_timer;
cudaEventCreate(&start);
cudaEventCreate(&stop); cudaEventRecord( start, 0 );
test_kernel<<<1000,256>>>(d_data);
cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
printf("cudaMalloc took %f ms\n",malloc_timer);
printf("Copy to the GPU took %f ms\n",mem_timer);
printf("Test Kernel took %f ms\n",kernel_timer);
cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost);
delete[] h_data;
return 0;
}
The results are
GPU0 cudaMalloc took 0.908640 ms Copy to the GPU took 296.058777 ms Test Kernel took 326.721283 ms
GPU1 cudaMalloc took 0.913568 ms Copy to the GPU took 663.182251 ms Test Kernel took 326.710785 ms
GPU2 cudaMalloc took 0.925600 ms Copy to the GPU took 296.915039 ms Test Kernel took 327.127930 ms
GPU3 cudaMalloc took 0.920416 ms Copy to the GPU took 296.968384 ms Test Kernel took 327.038696 ms
As you can see, the cudaMemcpy to the GPU is well double the amount of time for GPU1. This is consistent between all our servers, it is always GPU1 that is slow. Any ideas why this may be? All servers are running windows XP.
This was a driver issue. Updating to the latest driver fixed it
This may be an issue with your pci bus, try swapping the cards into different slots to see if the problem persists. If this is an issue, copy all your data onto the gtx295 via the faster slot and use sli top copy it across to the other (slow pci bus) gpu.
If you can utilized the faster video card's gddr to load, then you can do a device device tansfer at much MUCH higher bandwidth, that might help eliminate the issue also. Also, check your bandwidth with NVidia's bandwidth testing to get some physical results and test.
Good luck!
Are you running in a dual processor setup? There is a bug in the current Tylersburg chipsets such that the bandwidth of the path x86 (0) to GPU (1) is slower than the direct path from x86 (0) to GPU (0). Intel should release a new version to fix this bug. Try locking your test process to a specific CPU using taskset and see what results you get.
regards Mark
来源:https://stackoverflow.com/questions/2560446/cuda-memory-copy-to-gpu-1-is-slower-in-multi-gpu