nvidia | 易学教程

Changing the arch argument in CUDA makes me use more registers

阅读更多关于 Changing the arch argument in CUDA makes me use more registers

问题 I have been writing a kernel on my Tesla K20m, when I compile the software with -Xptas=-v I obtain the following results : ptxas info : 0 bytes gmem ptxas info : Compiling entry function '_Z9searchKMPPciPhiPiS1_' for 'sm_10' ptxas info : Used 8 registers, 80 bytes smem, 8 bytes cmem[1] as you can see, only 8 registers are used, however, if I mention the argument -arch=sm_35 the time my kernel executes raises dramatically and the number of registers used too, and I am wondering why nvcc

NVIDIA DEBUG MANAGER FOR ANDROID NDK ECLIPSE PLUGIN

阅读更多关于 NVIDIA DEBUG MANAGER FOR ANDROID NDK ECLIPSE PLUGIN

问题 Can anyone tell me where can i get. NVDebugMgrForAndroidNDK its is an "NVIDIA DEBUG MANAGER FOR ANDROID NDK ECLIPSE PLUGIN" Thanks 回答1: It's available as part of Nvidia's Tegra Android Development Pack. 回答2: I install 4.0r2 version for but didn't see NVDebugMgrForAndroidNDK 来源： https://stackoverflow.com/questions/9305560/nvidia-debug-manager-for-android-ndk-eclipse-plugin

NVIDIA CUDA SDK Examples Compilation Unsupported Architecture 'compute_20'

阅读更多关于 NVIDIA CUDA SDK Examples Compilation Unsupported Architecture 'compute_20'

问题 On compilation of the CUDA SDK, I'm getting a nvcc fatal : Unsupported gpu architecture 'compute_20' My toolkit is 2.3 and on a shared system (i.e cant really upgrade) and the driver version is also 2.3, running on 4 Tesla C1060s If it helps, the problem is being called in radixsort. It appears that a few people online have had this problem but i havent found anywhere that actually gives a solution. 回答1: I believe compute_20 is targeting Fermi hardware, which you do not have. Also, Cuda 2.3

On plans reuse in cuFFT

阅读更多关于 On plans reuse in cuFFT

问题 This may seem like a simple question but cufft usage is not very clear to me. My question is: which one of the following implementations is correct ? 1) // called in a loop cufftPlan3d (plan1, x, y, z) ; cufftexec (plan1, data1) ; cufftexec (plan1, data2) ; cufftexec (plan1, data3) ; destroyplan(plan1) 2) init() //called only one time in application { cufftPlan3d (plan1, x, y, z) ; } exec () //called many times with data changing size remains same { cufftexec (plan1, data1) ; cufftexec (plan1

Does Nvidia Cuda warp Scheduler yield?

阅读更多关于 Does Nvidia Cuda warp Scheduler yield?

问题 I have gone through Cuda programming guide but still not clear whether a warp will yield in favor of other ready-to-execute warp? Any explanation or pointer please? If yes, in what condition does a warp yield? 回答1: Yes, the on-chip scheduler interleaves the execution of warps. The scheduling policy is intentionally left unspecified, because the scheduling policy may be changed. NVIDIA does not want CUDA developers to write code that relies on the current scheduling policies but fails on newer

keras multiple_gpu_model causes “Can't pickle module object” error

阅读更多关于 keras multiple_gpu_model causes “Can't pickle module object” error

问题 This is a follow up of this question. I am trying to utilize 8 GPUs for training and am using the multiple_gpu_model from Keras. I specified a batch size of 128 which will be split amongst the 8 GPUs resulting in 16 per GPU. Now, with this configuration, I get the following error: Train on 6120 samples, validate on 323 samples Epoch 1/100 6120/6120 [==============================] - 42s 7ms/step - loss: 0.0996 - mean_iou: 0.6919 - val_loss: 0.0969 - val_mean_iou: 0.7198 Epoch 00001: val_loss

nvEncodeApp successfully make but in running it : NVENC error at CNVEncoder.cpp:1282 code=15 ( invalid struct version was used ) “nvStatus”

阅读更多关于 nvEncodeApp successfully make but in running it : NVENC error at CNVEncoder.cpp:1282 code=15 ( invalid struct version was used ) “nvStatus”

问题 I make nvEncodeApp successfully but when I run it my output is like this ./nvEncoder -infile=HeavyHandIdiot.3sec.yuv -outfile=outh.264 -width=1080 -height=1080 > NVEncode configuration parameters for Encoder[0] > GPU Device ID = 0 > Input File = HeavyHandIdiot.3sec.yuv > Output File = outh.264 > Frames [000--01] = 0 frames > Multi-View Codec = No > Width,Height = [1080,1080] > Video Output Codec = 4 - H.264 Codec > Average Bitrate = 0 (bps/sec) > Peak Bitrate = 0 (bps/sec) > BufferSize = 0 >

Can I use GPUDirect v2 Peer-to-Peer communication between two Quadro K1100M or two GeForce GT 745M?

阅读更多关于 Can I use GPUDirect v2 Peer-to-Peer communication between two Quadro K1100M or two GeForce GT 745M?

问题 Can I use GPUDirect v2 - Peer-to-Peer communication on a single PCIe-Bus?: between two: Mobile nVidia Quadro K1100M between two: Mobile nVidia GeForce GT 745M 回答1: In general, if you want to find out if GPUDirect Peer to Peer is supported between two GPUs, you can run the simple P2P CUDA sample code or in your own code, you can test the availability with the cudaCanAccessPeer runtime API call Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU

How can I tell if H2O 3.11.0.266 is running with GPUs?

阅读更多关于 How can I tell if H2O 3.11.0.266 is running with GPUs?

问题 I've installed H2O 3.11.0.266 on a Ubuntu 16.04 with CUDA 8.0 and libcudnn.so.5.1.10 so I believe H2O should be able to find my GPUs. However, when I start up my h2o.init() in Python, I do not see evidence that it is actually using my GPUs. I see: H2O cluster total cores: 8 H2O cluster allowed cores: 8 which is the same as I had in the previous version (pre GPU). Also, http://127.0.0.1:54321/flow/index.html shows only 8 cores as well. I wonder if I don't have something properly installed or

terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)9>'

阅读更多关于 terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation'

问题 I am newbie in SYCL/OpenCL/GPGPU. I am trying to build and run sample code of constant addition program , #include <iostream> #include <array> #include <algorithm> #include <CL/sycl.hpp> namespace sycl = cl::sycl; //<<Define ConstantAdder>> template<typename T, typename Acc, size_t N> class ConstantAdder { public: ConstantAdder(Acc accessor, T val) : accessor(accessor) , val(val) {} void operator() () { for (size_t i = 0; i < N; i++) { accessor[i] += val; } } private: Acc accessor; const T