nvidia | 易学教程

Why we do not have access to device memory on host side?

阅读更多关于 Why we do not have access to device memory on host side?

问题 I asked a question Memory allocated using cudaMalloc() is accessable by host or not? though the things are much clear to me now, but I am still wondering why it is not possible to access the device pointer in host. My understanding is that the CUDA driver takes care of memory allocation inside GPU DRAM. So this information (that what is my first address of allocated memory in device), can be conveyed to the OS running on host. Then it can be possible to access this device pointer i.e the

Reading updated memory from other CUDA stream

阅读更多关于 Reading updated memory from other CUDA stream

问题 I am trying to set a flag in one kernel function and read it in another. Basically, I'm trying to do the following. #include <iostream> #include <cuda.h> #include <cuda_runtime.h> #define FLAGCLEAR 0 #define FLAGSET 1 using namespace std; __global__ void set_flag(int *flag) { *flag = FLAGSET; // Wait for flag to reset. while (*flag == FLAGSET); } __global__ void read_flag(int *flag) { // wait for the flag to set. while (*flag != FLAGSET); // Clear it for next time. *flag = FLAGCLEAR; } int

Generate Index using CUDA-C

阅读更多关于 Generate Index using CUDA-C

问题 I am trying to generate set of indices below: I have a cuda block that consists of 20 blocks(blockIdx:from 0 -19) with each block subdivided into 4 blocks (sub block Idx: 0,1,2 and 3). I am trying to generate index pattern like this : threadIdx (tid),SubBlockIdxA(SA),SubBlockIdxB(SB), BlockIdxA(BA),BlockIdxB(BB) Required Obtained tid SBA SBB BA BB SBA SBB BA BB 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 2 0 1 1 1 0 1 1 1 3 1 0 1 2 1 0 1 2 4 0 1 2 2 0 1 2 2 5 1 0 2 3 1 0 2 3 6 0 1 3 3 0 1 3 3 7 1 0 3

OpenCL errors on long running tasks

阅读更多关于 OpenCL errors on long running tasks

问题 I'm running a long-running kernel on a nVidia Quattro 6000 device. The kernel involves a loop with tens of thousands of iterations. When I ran the kernel, after 2 seconds the screen went black, Windows restarted GPU drivers and clFinish returned an error. So I got myself a second GPU card just for displaying and now the 2 seconds timeout does not apply. The kernel computed for 50 seconds and then there were these errors (lines prefixed by "GPU ERROR" are errors printed by clCreateContext

Keras multi_gpu_model causes system to crash

阅读更多关于 Keras multi_gpu_model causes system to crash

问题 I am trying to train a rather large LSTM on a large dataset and have 4 GPUs to distribute the load. If I try to train on just one of them (any of them, I've tried each) it functions correctly, but after adding the multi_gpu_model code it crashes my entire system when I try to run it. Here is my multi-gpu code batch_size = 8 model = Sequential() model.add(Masking(mask_value=0., input_shape=(len(inputData[0]), len(inputData[0][0])) )) model.add(LSTM(256, return_sequences=True)) model.add

Unable to run rootless containers via Nvidia runtime

阅读更多关于 Unable to run rootless containers via Nvidia runtime

问题 I am totally new to the docker and nvidia. I am trying to install nvidia platform called clara on a unix server. While I try to run rootless containers via nvidia runtime, I get the below error (upon executing 2nd line of code). Can you please help me with it? I already created a config file and placed this file under "config" folder and have updated this path in the 'Taskfile.yml' file. Moreover, through google search I found few threads which indicated that it can be due to cgroups. However

How to reduce nonconsecutive segments of numbers in array with Thrust

阅读更多关于 How to reduce nonconsecutive segments of numbers in array with Thrust

问题 I have 1D array "A" which is composed from many arrays "a" like this : I'm implementing a code to sum up non consecutive segments ( sum up the numbers in the segments of the same color of each array "a" in "A" as follow: Any ideas to do that efficiently with thrust? Thank you very much Note: The pictures represents only one array "a". The big array "A" contains many arrays "a" 回答1: In the general case, where the ordering of the data and grouping by segments is not known in advance, the

location of cudaEventRecord and overlapping ops from different streams

阅读更多关于 location of cudaEventRecord and overlapping ops from different streams

问题 I have two tasks. Each of them perform copy to device (D), run kernel (R), and copy to host (H) operations. I am overlapping copy to device of task2 (D2) with run kernel of task1 (R1). In addition, I am overlapping run kernel of task2 (R2) with copy to host of task1 (H1). I also record start and stop time of D, R, H ops of each task using cudaEventRecord. I have GeForce GT 555M, CUDA 4.1, and Fedora 16. I have three scenarios: Scenario1: I use one stream for each task. I place start/stop

Tensorflow: GPU Acceleration only happens after first run

阅读更多关于 Tensorflow: GPU Acceleration only happens after first run

问题 I've installed CUDA and CUDNN on my machine (Ubuntu 16.04) alongside tensorflow-gpu . Versions used: CUDA 10.0, CUDNN 7.6, Python 3.6, Tensorflow 1.14 This is the output from nvidia-smi , showing the video card configuration. | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute

Tensorflow CUDA GTX 1070 import error

阅读更多关于 Tensorflow CUDA GTX 1070 import error

问题 I'm trying to install Tensorflow with CUDA support. Here are my specs: NVIDIA GTX 1070 CUDA 7.5 Cudnn v5.0 I have installed Tensorflow via the pip installation -- so I'm picturing your answer being to install from source, but I want to make sure there isn't a quick fix. The error is: volcart@volcart-Precision-Tower-7910:~$ python Python 2.7.10 (default, Oct 14 2015, 16:09:02) [GCC 5.2.1 20151010] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import