gpgpu | 易学教程

OpenMP 4.0 for accelerators: Nvidia GPU target

阅读更多关于 OpenMP 4.0 for accelerators: Nvidia GPU target

问题 I'm trying to use openMP for accelerators (openMP 4.0) in Visual Studio 2012, using the Intel C++ 15.0 compiler. My accelerator is an Nvidia GeForce GTX 670. This code does not compile: #include <stdio.h> #include<iostream> #include <omp.h> using namespace std; int main(){ #pragma omp target #pragma omp parallel for for (int i=0; i<1000; i++) cout<<"Hello world, i am number "<< i <<endl; } Of course, everything goes fine when I comment the #pragma omp target line. I get the same problem when

Zero Copy Buffers using cl_arm_import_memory extension in OpenCL 1.2 - arm mali midgard GPUs

阅读更多关于 Zero Copy Buffers using cl_arm_import_memory extension in OpenCL 1.2 - arm mali midgard GPUs

问题 I wish to allocate a vector and use it's data pointer to allocate a zero copy buffer on the GPU. There is this cl_arm_import_memory extension which can be used to do this. But I am not sure wether its supported for all mali midgard OpenCL drivers or not. I was going through this link and I am quite puzzled by the following lines : - If the extension string cl_arm_import_memory_host is exposed then importing from normal userspace allocations (such as those created via malloc) is supported.

Zero Copy Buffers using cl_arm_import_memory extension in OpenCL 1.2 - arm mali midgard GPUs

阅读更多关于 Zero Copy Buffers using cl_arm_import_memory extension in OpenCL 1.2 - arm mali midgard GPUs

Cuda optimization techniques

阅读更多关于 Cuda optimization techniques

问题 I have written a CUDA code to solve an NP-Complete problem, but the performance was not as I suspected. I know about "some" optimization techniques (using shared memroy, textures, zerocopy...) What are the most important optimization techniques CUDA programmers should know about? 回答1: You should read NVIDIA's CUDA Programming Best Practices guide: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide.pdf This has multiple different performance tips

What do I need for programming for Tegra GPU

阅读更多关于 What do I need for programming for Tegra GPU

问题 Can I develop applications on CUDA processor Tegra 1/2, what do I need for this and what Tegra 1/2 CUDA-capability? I found only NVIDIA Debug Manager for development in Eclipse for Android, but I do not know if he can develop a CUDA-style. 回答1: Current Tegra processors (Tegra 1, 2 and 3) do not support the CUDA platform. To learn about Tegra development and download the Tegra Android Development Kit, see the NVIDIA developer zone for mobile. 回答2: See similar question/answers here: CUDA

Access/synchronization to local memory

阅读更多关于 Access/synchronization to local memory

问题 I'm pretty new to GPGPU programming. I'm trying to implement algorithm that needs lot of synchronization, so its using only one work-group (global and local size have the same value) I have fallowing problem: my program is working correctly till size of problem exceeds 32. __kernel void assort( __global float *array, __local float *currentOutput, __local float *stimulations, __local int *noOfValuesAdded, __local float *addedValue, __local float *positionToInsert, __local int *activatedIdx, _

Access/synchronization to local memory

阅读更多关于 Access/synchronization to local memory

CUDA block synchronization differences between GTS 250 and Fermi devices

阅读更多关于 CUDA block synchronization differences between GTS 250 and Fermi devices

问题 So I've been working on program in which I'm creating a hash table in global memory. The code is completely functional (albeit slower) on a GTS250 which is a Compute 1.1 device. However, on a Compute 2.0 device (C2050 or C2070) the hash table is corrupt (data is incorrect and pointers are sometimes wrong). Basically the code works fine when only one block is utilized (both devices). However, when 2 or more blocks are used, it works only on the GTS250 and not on any Fermi devices. I understand

GPU card resets after 2 seconds

阅读更多关于 GPU card resets after 2 seconds

问题 I'm using an NVIDIA geforce card that gives an error after 2 seconds if I try to run some CUDA program on it. I read here that you can use the TDRlevel key in HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers . However, I don't see any such key in the registry. Does it needs to be added yourself? Have somebody else experienced this problem. If so, how did you solve it? Thanks. 回答1: I'm assuming you are using Windows Vista or later. The article you linked to contains a list

What is the difference between cudaMemcpy() and cudaMemcpyPeer() for P2P-copy?

阅读更多关于 What is the difference between cudaMemcpy() and cudaMemcpyPeer() for P2P-copy?

问题 I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM. As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf Peer-to-Peer Memcpy  Direct copy from pointer on GPU A to pointer on GPU B  With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)  Or cudaMemcpyAsync(…, cudaMemcpyDefault)  Also non-UVA explicit P2P copies:  cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src, int srcDevice, size_t count )  cudaError_t