nvidia | 易学教程

CUDA: GPUDirect on GeForce GTX 690

阅读更多关于 CUDA: GPUDirect on GeForce GTX 690

问题 The GeForce GTX 690 (from vendors like Zotac and EVGA) can be used for CUDA programming, much like a Tesla K10. Question: Does the GeForce GTX 690 support GPUDirect? Specifically: If I were to use two GTX 690 cards, I would have 4 GPUs (two GPUs within each card). If I connect both GTX 690 cards to the same PCIe switch, will GPUDirect work well for communication between any pair of the 4 GPUs? Thanks. 回答1: According to the requirements stated here it is necessary to have Tesla series GPUs. So

Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

阅读更多关于 Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

问题 I am looking for a fast way to reduce multiple blocks of equal length that are arranged as a big vector. I have N subarrays(contiguous elements) that are arranged in one big array. each sub array has a fixed size : k. so the size of the whole array is : N*K What I'm doing is to call the kernel N times. in each time it computes the reduction of the subarray as follow: I will iterate over all the subarrays contained in the big vector : for(i=0;i<N;i++){ thrust::device_vector< float > Vec

efficiency of CUDA Scalar and SIMD video instructions

阅读更多关于 efficiency of CUDA Scalar and SIMD video instructions

问题 The throughput of SIMD instruction is lower that 32-bits integer arithmetic. In case of SM2.0 (Scalar instruction only versions) is 2 time lower. In case of SM3.0 is 6 time lower. What is a cases when suitable to use them ? 回答1: If your data is already packed in a format that is handled natively by a SIMD video instruction, then it would require multiple steps to unpack it so that it can be handled by an ordinary instruction. Furthermore, the throughput of a SIMD video instruction should also

GPU Memory not freeing itself after CUDA script execution

阅读更多关于 GPU Memory not freeing itself after CUDA script execution

问题 I am having an issue with my Graphics card retaining memory after the execution of a CUDA script (even with the use of cudaFree()). On boot the Total Used memory is about 128MB but after the script runs it runs out of memory mid execution. nvidia-sma: +------------------------------------------------------+ | NVIDIA-SMI 340.29 Driver Version: 340.29 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC

Cuda char* variable assignment

阅读更多关于 Cuda char* variable assignment

问题 This is a follow up question to the selected answer in this post: Output of cuda program is not what was expected. While the below functions works: __global__ void setVal(char **word) { char *myWord = word[(blockIdx.y * gridDim.x) + blockIdx.x]; myWord[0] = 'H'; myWord[1] = 'e'; myWord[2] = 'l'; myWord[3] = 'l'; myWord[4] = 'o'; } Why does not this work? __global__ void setVal(char **word) { char *myWord = word[(blockIdx.y * gridDim.x) + blockIdx.x]; myWord = "Hello\0"; } 回答1: You should

Can CUDA code damage a GPU?

阅读更多关于 Can CUDA code damage a GPU?

问题 While testing a piece of CUDA containing a memory bug, my screen got frozen. After rebooting I cannot detect anymore the graphics card. Is it possible that my code physically damaged the card? This happened under Ubuntu 14.04. I don't know the model of the card, as I cannot detect it but I remember it is a fairly new one. 回答1: Thanks to all the comments I solved the problem. I will list the actions that I undertook. I'm not sure if all of them had an effect but eventually the problem got

OpenCL: Correct results on CPU not on GPU: how to manage memory correctly?

阅读更多关于 OpenCL: Correct results on CPU not on GPU: how to manage memory correctly?

问题 __kernel void CKmix(__global short* MCL, __global short* MPCL,__global short *C, int S, int B) { unsigned int i=get_global_id(0); unsigned int ii=get_global_id(1); MCL[i]+=MPCL[B*ii+i+C[ii]+S]; } Kernel seams ok, it compiles successfully, and I have obtained the correct results using the CPU as a device, but that was when I had the program release and and recreate my memory objects each time the kernel is called, which for my testing purpose is about 16000 times. The code I am posting is

glReadPixels() burns up all CPU cycles of a single core

阅读更多关于 glReadPixels() burns up all CPU cycles of a single core

问题 I have an SDL2 app with an OpenGL window, and it is well behaved: When it runs, the app gets synchronized with my 60Hz display, and I see 12% CPU Usage for the app. So far so good. But when I add 3D picking by reading a single (!) depth value from the depth buffer (after drawing), the following happens: FPS still at 60 CPU usage for the main thread goes to 100% If I don't do the glReadPixels, the CPU use drops back to 12% again. Why does reading a single value from the depth buffer cause the

Fixing GLSL shaders for Nvidia and AMD

阅读更多关于 Fixing GLSL shaders for Nvidia and AMD

问题 I am having problems getting my GLSL shaders to work on both AMD and Nvidia hardware. I am not looking for help fixing a particular shader, but how to generally avoid getting these problems. Is it possible to check if a shader will compile on AMD/Nvidia drivers without running the application on a machine with the respective hardware and actually trying it? I know, in the end, testing is the only way to be sure, but during development I would like to at least avoid the obvious problems.

How to use make_transform_iterator() with counting_iterator<> and execution_policy in Thrust?

阅读更多关于 How to use make_transform_iterator() with counting_iterator and execution_policy in Thrust?

问题 I try to compile this code with MSVS2012, CUDA5.5, Thrust 1.7: #include <iostream> #include <thrust/iterator/counting_iterator.h> #include <thrust/iterator/transform_iterator.h> #include <thrust/find.h> #include <thrust/execution_policy.h> struct is_odd { __host__ __device__ bool operator()(uint64_t &x) { return x & 1; } }; int main() { thrust::counting_iterator<uint64_t> first(0); thrust::counting_iterator<uint64_t> last = first + 100; auto iter = thrust::find(thrust::device, thrust::make