nvidia

The behavior of __CUDA_ARCH__ macro

ぃ、小莉子 提交于 2021-01-27 14:07:10
问题 In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device. However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch). Can anyone confirm this is correct? 回答1: __CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled

Possible to virtualize NVIDIA GeForce GTX 1070 Graphics Card for Distributed Tensorflow?

会有一股神秘感。 提交于 2021-01-27 10:46:30
问题 I am running Windows 10 on Intel Core i7-8700 CPU with 16 GB RAM, 1 TB HDD and dedicated NVIDIA GeForce GTX 1070 graphics card. I plan to launch 3 Ubuntu instances hosted by my Windows 10 PC. The Ubuntus will be running Distributed Tensorflow (tensorflow-gpu) code, that will using GPU for training a Neural Network. (to mention, already I've tried the setup on Windows but failed) Q. Shall my NVIDIA GPU be virtualized among those Virtual Machines or Not? If YES, then is there any further

Possible to virtualize NVIDIA GeForce GTX 1070 Graphics Card for Distributed Tensorflow?

喜你入骨 提交于 2021-01-27 10:44:55
问题 I am running Windows 10 on Intel Core i7-8700 CPU with 16 GB RAM, 1 TB HDD and dedicated NVIDIA GeForce GTX 1070 graphics card. I plan to launch 3 Ubuntu instances hosted by my Windows 10 PC. The Ubuntus will be running Distributed Tensorflow (tensorflow-gpu) code, that will using GPU for training a Neural Network. (to mention, already I've tried the setup on Windows but failed) Q. Shall my NVIDIA GPU be virtualized among those Virtual Machines or Not? If YES, then is there any further

Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

◇◆丶佛笑我妖孽 提交于 2021-01-23 11:09:09
问题 I have several GPUs but I only want to use one GPU for my training. I am using following options: config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True) config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: Despite setting / using all these options, all of my GPUs allocate memory and #processes = #GPUs How can I prevent this from happening? Note I do not want use set the devices manually and I do not want to set CUDA_VISIBLE_DEVICES since I want

Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

送分小仙女□ 提交于 2021-01-23 11:02:41
问题 I have several GPUs but I only want to use one GPU for my training. I am using following options: config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True) config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: Despite setting / using all these options, all of my GPUs allocate memory and #processes = #GPUs How can I prevent this from happening? Note I do not want use set the devices manually and I do not want to set CUDA_VISIBLE_DEVICES since I want

What does #pragma unroll do exactly? Does it affect the number of threads?

孤人 提交于 2020-11-30 02:38:02
问题 I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique __global__ void kernel(float *b, int size) { int tid = blockDim.x * blockIdx.x + threadIdx.x; #pragma unroll for(int i=0;i<size;i++) b[i]=i; } Above is my kernel function. In main I call it like below int main() { float * a; //host array float * b; //device array int size=100; a=(float*)malloc(size*sizeof(float)); cudaMalloc((float**)&b,size); cudaMemcpy(b, a, size,

What does #pragma unroll do exactly? Does it affect the number of threads?

时间秒杀一切 提交于 2020-11-30 02:37:02
问题 I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique __global__ void kernel(float *b, int size) { int tid = blockDim.x * blockIdx.x + threadIdx.x; #pragma unroll for(int i=0;i<size;i++) b[i]=i; } Above is my kernel function. In main I call it like below int main() { float * a; //host array float * b; //device array int size=100; a=(float*)malloc(size*sizeof(float)); cudaMalloc((float**)&b,size); cudaMemcpy(b, a, size,

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

橙三吉。 提交于 2020-11-27 02:00:23
问题 List of Nvidia GPU - GeForce 900 Series - there is written that: 4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed. I.e. for example for GeForce GTX 970 we can calculate performance: 1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops) This value we can see in column - Processing Power (peak) GFLOPS - Single Precision. But why we must multiple by 2 ? There is written: http://devblogs.nvidia.com/parallelforall

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

南笙酒味 提交于 2020-11-27 01:55:35
问题 List of Nvidia GPU - GeForce 900 Series - there is written that: 4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed. I.e. for example for GeForce GTX 970 we can calculate performance: 1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops) This value we can see in column - Processing Power (peak) GFLOPS - Single Precision. But why we must multiple by 2 ? There is written: http://devblogs.nvidia.com/parallelforall

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

柔情痞子 提交于 2020-11-27 01:55:05
问题 List of Nvidia GPU - GeForce 900 Series - there is written that: 4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed. I.e. for example for GeForce GTX 970 we can calculate performance: 1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops) This value we can see in column - Processing Power (peak) GFLOPS - Single Precision. But why we must multiple by 2 ? There is written: http://devblogs.nvidia.com/parallelforall