nvidia | 易学教程

Trying to mix in OpenCL with CUDA in NVIDIA's SDK template

阅读更多关于 Trying to mix in OpenCL with CUDA in NVIDIA's SDK template

问题 I have been having a tough time setting up an experiment where I allocate memory with CUDA on the device, take that pointer to memory on the device, use it in OpenCL, and return the results. I want to see if this is possible. I had a tough time getting a CUDA project to work so I just used Nvidia's template project in their SDK. In the makefile I added -lOpenCL to the libs section of the common.mk. Everything is fine when I do that, but when I add #include <CL/cl.h> to template.cu so I can

Tensorflow can find right cudnn in one python file but fail in another

阅读更多关于 Tensorflow can find right cudnn in one python file but fail in another

问题 I am trying to use tensorflow gpu version to train and test my deep learning model. But here comes the problem. When I train my model in one python file things go on well. Tensorflow-gpu can be used properly. Then I save my model as a pretrained on as grapg.pb format and try to reuse it in another python file. Then I got the following error messages. E tensorflow/stream_executor/cuda/cuda_dnn.cc:363] Loaded runtime CuDNN library: 7.1.4 but source was compiled with: 7.2.1. CuDNN library major

NVRTC and device functions

阅读更多关于 NVRTC and __device__ functions

问题 I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__ function whose performances can be strongly improved by removing all global memory accesses. Does CUDA allow the dynamic compilation and linking of a single __device__ function (not a __global__ ), in order to "override" an existing function? 回答1: I am pretty sure the really short answer is no. Although CUDA has dynamic/JIT device linker support,

Forcing hardware accelerated rendering

阅读更多关于 Forcing hardware accelerated rendering

问题 I have an OpenGL library written in c++ that is used from a C# application using C++/CLI adapters. My problem is that if the application is used on laptops with Nvidia Optimus technology the application will not use the hardware acceleration and fail. I have tried to use the info found in Nvidias document http://developer.download.nvidia.com/devzone/devcenter/gamegraphics/files/OptimusRenderingPolicies.pdf about linking libs to my C++-dll and exporting NvOptimusEnablement from my OpenGL

Where do I find strings for Error code in OpenCL (Nvidia )?

阅读更多关于 Where do I find strings for Error code in OpenCL (Nvidia )?

问题 Running a simple OpenCL Matrix Multiplication code on NVIDIA GPU, I get error code -30. I want to know what is the meaning of this code. I am sure the string corresponding to this code (int) must be stored somewhere. Can someone help me in interpreting this code? Once I know what this error mean I can debug my code easily. 回答1: From the CLEW library: const char* clewErrorString(cl_int error) { static const char* strings[] = { // Error Codes "CL_SUCCESS" // 0 , "CL_DEVICE_NOT_FOUND" // -1 ,

Why is OpenCV Gpu module performing faster than VisionWorks?

阅读更多关于 Why is OpenCV Gpu module performing faster than VisionWorks?

问题 I have tried several functions of OpenCv gpu module and compared the same behavior with visionWorks immediate code. And surprisingly, it all circumstances the OpenCv Gpu Module is performing significantly faster than VisionWorks. e-g a Gaussian pyramid of level 4 implemented manually using opencv #include <iostream> #include <stdio.h> #include <stdio.h> #include <queue> /* OPENCV RELATED */ #include <cv.h> #include <highgui.h> #include "opencv2/highgui/highgui.hpp" #include "opencv2/imgproc

“Global Load Efficiency” over 100%

阅读更多关于 “Global Load Efficiency” over 100%

问题 I have a CUDA program in which threads of a block read elements of a long array in several iterations and memory accesses are almost fully coalesced. When I profile, Global Load Efficiency is over 100% (between 119% and 187% depending on the input). Description for Global Load Efficiency is " Ratio of global memory load throughput to required global memory load throughput. " Does it mean that I'm hitting L2 cache a lot and my memory accesses are benefiting from it? My GPU is GeForce GTX 780

CUDA unknown error

阅读更多关于 CUDA unknown error

问题 I'm trying to run mainSift.cpp from CudaSift on a Nvidia Tesla M2090. First of all, as explained in this question, I had to change from sm_35 to sm_20 the CMakeLists.txt . Unfortunatley now this error is returned: checkMsg() CUDA error: LaplaceMulti() execution failed in file </ghome/rzhengac/Downloads/CudaSift/cudaSiftH.cu>, line 318 : unknown error. And this is the LaplaceMulti code: double LaplaceMulti(cudaTextureObject_t texObj, CudaImage *results, float baseBlur, float diffScale, float

CUDA: Mapping Error using CUSPARSE csrmv() routine

阅读更多关于 CUDA: Mapping Error using CUSPARSE csrmv() routine

问题 I'm currently trying to use the CUSPARSE library in order to speed up an HPCG implementation. However, it appears I'm making some kind of mistake during device data allocation. This is the code segment that results in CUSPARSE_STATUS_MAPPING_ERROR : int HPC_sparsemv( CRS_Matrix *A_crs_d, FP * x_d, FP * y_d) { FP alpha = 1.0f; FP beta = 0.0f; FP* vals = A_crs_d->vals; int* inds = A_crs_d->col_ind; int* row_ptr = A_crs_d->row_ptr; /*generate Matrix descriptor for SparseMV computation*/

laptop dual video cards - how to programatically detect and/or choose which one is used

阅读更多关于 laptop dual video cards - how to programatically detect and/or choose which one is used

问题 We're developing software which uses DirectX for 3D rendering on Windows 7 and later machines, 64-bit C#/.NET code. We've observed that a number of newer Dell laptops we're testing on have dual video cards. They have the Intel HD 4600 integrated graphics and they also have a faster NVIDIA Quadro card (for example). By default, out of the box, the Intel graphics are used by the DirectX application. This is done, presumably to preserve battery life. But the performance is noticeably worse than