nvidia | 易学教程

The behavior of __CUDA_ARCH__ macro

阅读更多关于 The behavior of __CUDA_ARCH__ macro

问题 In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device. However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch). Can anyone confirm this is correct? 回答1: __CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled

Possible to virtualize NVIDIA GeForce GTX 1070 Graphics Card for Distributed Tensorflow?

阅读更多关于 Possible to virtualize NVIDIA GeForce GTX 1070 Graphics Card for Distributed Tensorflow?

问题 I am running Windows 10 on Intel Core i7-8700 CPU with 16 GB RAM, 1 TB HDD and dedicated NVIDIA GeForce GTX 1070 graphics card. I plan to launch 3 Ubuntu instances hosted by my Windows 10 PC. The Ubuntus will be running Distributed Tensorflow (tensorflow-gpu) code, that will using GPU for training a Neural Network. (to mention, already I've tried the setup on Windows but failed) Q. Shall my NVIDIA GPU be virtualized among those Virtual Machines or Not? If YES, then is there any further

Possible to virtualize NVIDIA GeForce GTX 1070 Graphics Card for Distributed Tensorflow?

阅读更多关于 Possible to virtualize NVIDIA GeForce GTX 1070 Graphics Card for Distributed Tensorflow?

Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

阅读更多关于 Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

问题 I have several GPUs but I only want to use one GPU for my training. I am using following options: config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True) config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: Despite setting / using all these options, all of my GPUs allocate memory and #processes = #GPUs How can I prevent this from happening? Note I do not want use set the devices manually and I do not want to set CUDA_VISIBLE_DEVICES since I want

Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

阅读更多关于 Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

What does #pragma unroll do exactly? Does it affect the number of threads?

阅读更多关于 What does #pragma unroll do exactly? Does it affect the number of threads?

问题 I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique __global__ void kernel(float *b, int size) { int tid = blockDim.x * blockIdx.x + threadIdx.x; #pragma unroll for(int i=0;i<size;i++) b[i]=i; } Above is my kernel function. In main I call it like below int main() { float * a; //host array float * b; //device array int size=100; a=(float*)malloc(size*sizeof(float)); cudaMalloc((float**)&b,size); cudaMemcpy(b, a, size,

What does #pragma unroll do exactly? Does it affect the number of threads?

阅读更多关于 What does #pragma unroll do exactly? Does it affect the number of threads?

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

阅读更多关于 Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

问题 List of Nvidia GPU - GeForce 900 Series - there is written that: 4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed. I.e. for example for GeForce GTX 970 we can calculate performance: 1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops) This value we can see in column - Processing Power (peak) GFLOPS - Single Precision. But why we must multiple by 2 ? There is written: http://devblogs.nvidia.com/parallelforall

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

阅读更多关于 Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

阅读更多关于 Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?