nvidia | 易学教程

How to build CUDA JIT caches for all available kernels in TensorFlow programmatically?

阅读更多关于 How to build CUDA JIT caches for all available kernels in TensorFlow programmatically?

问题 I encountered the "first-run slow-down" problem with GTX 1080 cards and nvidia-docker as discussed in this question. I'm using the TensorFlow build from its official pip package and a custom docker image based on nvidia-docker's Ubuntu 16.04 base image. How do I make TensorFlow to load (and build JIT caches) all registered CUDA kernels programmatically in a Dockerfile? (rather than manually building TensorFlow using TF_CUDA_COMPUTE_CAPABILITIES environment variable) 回答1: There seems to be no

OpenCvSharp installed using NUGET PAckage Manager not detecting a CUDA Device

阅读更多关于 OpenCvSharp installed using NUGET PAckage Manager not detecting a CUDA Device

问题 I am trying to include GPU using OpenCvSharp . I installed the OpenCvSharp by using Nuget Package Manager in Microsoft Visual Studio 2013. I have included these lines already using OpenCvSharp; using OpenCvSharp.CPlusPlus; using OpenCvSharp.CPlusPlus.Gpu; but when i check the device count //GPU int count = Cv2Gpu.GetCudaEnabledDeviceCount(); //int count = Cv2Gpu.ge Console.WriteLine("The GPU Device count is " + count.ToString()); it always returns 0. Now it also says if OpenCv is not compiled

OpenCvSharp installed using NUGET PAckage Manager not detecting a CUDA Device

阅读更多关于 OpenCvSharp installed using NUGET PAckage Manager not detecting a CUDA Device

Can Tensorflow be installed alongside Theano?

阅读更多关于 Can Tensorflow be installed alongside Theano?

问题 I'm trying to install tensorflow alongside Theano on a Nvidia Tesla K80. I'm working with Cuda 7.5 and following the instructions given here Theano by itself works well, but as soon as I install tensorflow from source code following the instructions OR using pip install, nvidia-smi as well as Theano stops working. More specifically, nvidia-smi hangs indefinitely whereas Theano just refuses to run in GPU mode. I'm also using the latest version of cudnn v4. Does Tensorflow have known issues

Java OpenGL EXCEPTION_ACCESS_VIOLATION on glDrawArrays only on NVIDIA

阅读更多关于 Java OpenGL EXCEPTION_ACCESS_VIOLATION on glDrawArrays only on NVIDIA

问题 I'm working on a game in java using lwjgl and it's OpenGL implementation. Never had any problems until I exchanged it with a colleague who uses NVIDIA instead of AMD, and suddenly it crashes on a line that works on AMD but it only crashes at that point in the code. That is the weardest part because I use the same method to create the VBOs from .obj-files. I even tried it with the same file but still at that point it crashes on all other occasion not. Could it be maybe a wrong set flag or

cudamemcpyasync and streams behaviour understanding

阅读更多关于 cudamemcpyasync and streams behaviour understanding

问题 I have this simple code shown below which is doing nothing but just copies some data to the device from host using the streams. But I am confused after running the nvprof as to cudamemcpyasync is really async and understanding of the streams. #include <stdio.h> #define NUM_STREAMS 4 cudaError_t memcpyUsingStreams (float *fDest, float *fSrc, int iBytes, cudaMemcpyKind eDirection, cudaStream_t *pCuStream) { int iIndex = 0 ; cudaError_t cuError = cudaSuccess ; int iOffset = 0 ; iOffset = (iBytes

JOGL - monitor GPU memory

阅读更多关于 JOGL - monitor GPU memory

问题 I am looking for some JOGL classes/methods/examples to retrieve the size of total available GPU memory and the current available GPU memory. I know it can be done using OpenGl (JOGL Java docs). 回答1: The link you posted uses NVidia proprietary extensions. However given the way modern GPUs operate it's absolutely useless to know how much "memory" there's left. Why? Because OpenGL always operated on an abstract memory model. Single data objects (textures, VBOs) may be too large to fit into the

Counting FLOPS/GFLOPS in program - CUDA

阅读更多关于 Counting FLOPS/GFLOPS in program - CUDA

问题 Already finished my application which multiplies CRS matrix and vector (SpMV) and the only thing to do now is to count FLOPS my application did. In my opinion it's really hard to estimate number of floating point operation in case of sparse matrix - vector multiplication, because the number of multiplies in one row is really "jumpy" or fluent. I only tried to measure time using "cudaprof" ( available in ./CUDA/bin directory) - it works fine. Any sugestions and instruction pastes appreciated !

How to debug (GLSL) shaders using Nsight?

阅读更多关于 How to debug (GLSL) shaders using Nsight?

问题 How can I debug glsl shaders using Nsight? I am using Nsight Visual Studio Edition 5.2. I've tried using Nsight Visual Studio Edition 5.1. These both don't work. What I mean is that I've tried using this method and it doesn't work: Open Visual Studio Project Select "Nsight" from Menu and "Start Graphics Debugging" Let the program run for a while Press "Ctrl+Z" Press "Space" Go to "API Inspector" in Visual Studio Select "Program" from left side bar Select a "Source" from "Linked Shader State"

nvprof events “fb_subp0_read_sectors” and “fb_subp1_read_sectors” do not report correct results

阅读更多关于 nvprof events “fb_subp0_read_sectors” and “fb_subp1_read_sectors” do not report correct results

问题 I tried to count the number of DRAM (global memory) accesses for simple vector add kernel. __global__ void AddVectors(const float* A, const float* B, float* C, int N) { int blockStartIndex = blockIdx.x * blockDim.x * N; int threadStartIndex = blockStartIndex + threadIdx.x; int threadEndIndex = threadStartIndex + ( N * blockDim.x ); int i; for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){ C[i] = A[i] + B[i]; } } Grid Size = 180 Block size = 128 size of array = 180 * 128 * N floats