nvidia | 易学教程

memset in cuda that allows to set values within kernel

阅读更多关于 memset in cuda that allows to set values within kernel

问题 i am making several cudamemset calls in order to set my values to 0 as below: void allocateByte( char **gStoreR,const int byte){ char **cStoreR = (char **)malloc(N * sizeof(char*)); for( int i =0 ; i< N ; i++){ char *c; cudaMalloc((void**)&c, byte*sizeof(char)); cudaMemset(c,0,byte); cStoreR[i] = c; } cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice); } However, this is proving to be very slow. Is there a memset function on the GPU as calling it from CPU takes lot of

Computing the mean of 2000 2D-arrays with CUDA C

阅读更多关于 Computing the mean of 2000 2D-arrays with CUDA C

问题 I have 2000 2D-arrays ( each array is 1000x1000) I need to computer the mean of each one and put the result in one 2000 vector. I tried to do that by call the kernel for each 2D-array but it is naive I want to do the computation at once. this is what im done is a kernel for one 2D-array. i wanna make kernel to do this for 2000 2D-arrays in one kernel. #include <stdio.h> #include <cuda.h> #include <time.h> void init_mat(float *a, const int N, const int M); void print_mat(float *a, const int N,

NVIDIA cuda GPU computing questions

阅读更多关于 NVIDIA cuda GPU computing questions

问题 I installed tensorflow-gpu on win10. I am trying a keras trainning example to test the GPU computing. I loaded all the cuda successfully but show the following: Train on 60000 samples, validate on 10000 samples Epoch 1/100 I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 970M major: 5 minor: 2 memoryClockRate (GHz) 1.038 pciBusID 0000:01:00.0 Total memory: 3.00GiB Free

CUDA Warps and Optimal Number of Threads Per Block

阅读更多关于 CUDA Warps and Optimal Number of Threads Per Block

问题 From what I understand about Kepler GPUs, and CUDA in general, is that when a single SMX unit works on a block, it launches warps which are groups of 32 threads. Now here are my questions: 1) If the SMX unit can work on 64 warps, that means there is a limit of 32x64 = 2048 threads per SMX unit. But Kepler GPUs have 4 warp schedulers, so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel? And if so, does this mean I should really be looking for blocks that

Information on current GPU Architectures

阅读更多关于 Information on current GPU Architectures

问题 I have decided that my bachelors thesis will be about general purpose GPU-computing and which problems are more suitable for this than others. I am also trying to find out if there are any major differences between the current GPU architectures that may affect this. I am currently looking for some scientific papers and/or information directly from the manufacturers about the current GPU Architectures , but I can't seem to find anything that looks detailed enough. Therefore, I am hoping that

How to install CUDA 8.0 in the latest version of Tensorflow (1.0) in AWS p2.xlarge instance, AMI ami-edb11e8d and nvidia drivers up to date (375.39)

阅读更多关于 How to install CUDA 8.0 in the latest version of Tensorflow (1.0) in AWS p2.xlarge instance, AMI ami-edb11e8d and nvidia drivers up to date (375.39)

问题 I have upgraded to Tensorflow version 1.0 and installed CUDA 8.0 with the cudnn 5.1 version and the nvidia drivers up to date 375.39. My NVIDIA hardware is the one that is on Amazon Web Services using the p2.xlarge instance, a Tesla K-80. My OS is Linux 64-bit. I get the next error message every time I use the command: tf.Session() [ec2-user@ip-172-31-7-96 CUDA]$ python Python 2.7.12 (default, Sep 1 2016, 22:14:00) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright",

Cuda PTX registers declaration and using

阅读更多关于 Cuda PTX registers declaration and using

问题 I am trying to reduce number of using registers in my kernel, so I am decide to try inline PTX. This kernel: #define Feedback(a, b, c, d, e) d^e^(a&c)^(a&e)^(b&c)^(b&e)^(c&d)^(d&e)^(a&d&e)^(a&c&e)^(a&b&d)^(a&b&c) __global__ void Test(unsigned long a, unsigned long b, unsigned long c, unsigned long d, unsigned long e, unsigned long f, unsigned long j, unsigned long h, unsigned long* res) { res[0] = Feedback( a, b, c, d, e ); res[1] = Feedback( b, c, d, e, f ); res[2] = Feedback( c, d, e, f, j

running NVENC sdk sample get error because there is not libnvidia-encode

阅读更多关于 running NVENC sdk sample get error because there is not libnvidia-encode

问题 when I want to make nvEncodeApp NVENC SDK sample on centos 6.4 I got this error : /usr/bin/ld: cannot find -lnvidia-encode when I checked Make file the path to this library was here : -L/usr/lib64 -lnvidia-encode -ldl I checked /usr/lib64 but there is not any libnvidia-encode there: how this library will add to this path ,whats this library ? Using nvidia-smi should tell you that: nvidia-smi Tue Jul 16 20:19:20 2013 +------------------------------------------------------+ | NVIDIA-SMI 4.304

Large for loop crashing in GeForce Nvidia GT 610

阅读更多关于 Large for loop crashing in GeForce Nvidia GT 610

问题 I have an OpenCL kernel with two nested loops. It works fine up to a certain number of iterations, but crashes when the number of iterations is increased. The loop essentially does not create any new data (i.e., there is no global memory overflow etc.), it just iterates more number of time. What can I do to allow more iterations? Has anyone encountered this problem? Thanks a lot 回答1: Are you running this on Windows? Windows has a watchdog timer mechanism that restarts the display driver if it

Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

阅读更多关于 Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

问题 I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04. Here is the minimum code that can reproduce the result: CUDASample.cuh class CUDASample{ public: void AddOneToVector(std::vector<int> &in); }; CUDASample.cu __global__ static void CUDAKernelAddOneToVector(int *data) { const int x = blockIdx.x * blockDim.x + threadIdx.x; const int y = blockIdx.y * blockDim.y + threadIdx.y; const