gpgpu

OpenMP 4.0 for accelerators: Nvidia GPU target

自闭症网瘾萝莉.ら 提交于 2020-01-25 18:06:06
问题 I'm trying to use openMP for accelerators (openMP 4.0) in Visual Studio 2012, using the Intel C++ 15.0 compiler. My accelerator is an Nvidia GeForce GTX 670. This code does not compile: #include <stdio.h> #include<iostream> #include <omp.h> using namespace std; int main(){ #pragma omp target #pragma omp parallel for for (int i=0; i<1000; i++) cout<<"Hello world, i am number "<< i <<endl; } Of course, everything goes fine when I comment the #pragma omp target line. I get the same problem when

Zero Copy Buffers using cl_arm_import_memory extension in OpenCL 1.2 - arm mali midgard GPUs

女生的网名这么多〃 提交于 2020-01-25 02:48:52
问题 I wish to allocate a vector and use it's data pointer to allocate a zero copy buffer on the GPU. There is this cl_arm_import_memory extension which can be used to do this. But I am not sure wether its supported for all mali midgard OpenCL drivers or not. I was going through this link and I am quite puzzled by the following lines : - If the extension string cl_arm_import_memory_host is exposed then importing from normal userspace allocations (such as those created via malloc) is supported.

Zero Copy Buffers using cl_arm_import_memory extension in OpenCL 1.2 - arm mali midgard GPUs

◇◆丶佛笑我妖孽 提交于 2020-01-25 02:48:47
问题 I wish to allocate a vector and use it's data pointer to allocate a zero copy buffer on the GPU. There is this cl_arm_import_memory extension which can be used to do this. But I am not sure wether its supported for all mali midgard OpenCL drivers or not. I was going through this link and I am quite puzzled by the following lines : - If the extension string cl_arm_import_memory_host is exposed then importing from normal userspace allocations (such as those created via malloc) is supported.

Cuda optimization techniques

二次信任 提交于 2020-01-24 22:14:28
问题 I have written a CUDA code to solve an NP-Complete problem, but the performance was not as I suspected. I know about "some" optimization techniques (using shared memroy, textures, zerocopy...) What are the most important optimization techniques CUDA programmers should know about? 回答1: You should read NVIDIA's CUDA Programming Best Practices guide: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide.pdf This has multiple different performance tips

What do I need for programming for Tegra GPU

元气小坏坏 提交于 2020-01-22 17:53:46
问题 Can I develop applications on CUDA processor Tegra 1/2, what do I need for this and what Tegra 1/2 CUDA-capability? I found only NVIDIA Debug Manager for development in Eclipse for Android, but I do not know if he can develop a CUDA-style. 回答1: Current Tegra processors (Tegra 1, 2 and 3) do not support the CUDA platform. To learn about Tegra development and download the Tegra Android Development Kit, see the NVIDIA developer zone for mobile. 回答2: See similar question/answers here: CUDA

Access/synchronization to local memory

☆樱花仙子☆ 提交于 2020-01-17 06:22:29
问题 I'm pretty new to GPGPU programming. I'm trying to implement algorithm that needs lot of synchronization, so its using only one work-group (global and local size have the same value) I have fallowing problem: my program is working correctly till size of problem exceeds 32. __kernel void assort( __global float *array, __local float *currentOutput, __local float *stimulations, __local int *noOfValuesAdded, __local float *addedValue, __local float *positionToInsert, __local int *activatedIdx, _

Access/synchronization to local memory

狂风中的少年 提交于 2020-01-17 06:22:00
问题 I'm pretty new to GPGPU programming. I'm trying to implement algorithm that needs lot of synchronization, so its using only one work-group (global and local size have the same value) I have fallowing problem: my program is working correctly till size of problem exceeds 32. __kernel void assort( __global float *array, __local float *currentOutput, __local float *stimulations, __local int *noOfValuesAdded, __local float *addedValue, __local float *positionToInsert, __local int *activatedIdx, _

CUDA block synchronization differences between GTS 250 and Fermi devices

老子叫甜甜 提交于 2020-01-17 05:48:15
问题 So I've been working on program in which I'm creating a hash table in global memory. The code is completely functional (albeit slower) on a GTS250 which is a Compute 1.1 device. However, on a Compute 2.0 device (C2050 or C2070) the hash table is corrupt (data is incorrect and pointers are sometimes wrong). Basically the code works fine when only one block is utilized (both devices). However, when 2 or more blocks are used, it works only on the GTS250 and not on any Fermi devices. I understand

GPU card resets after 2 seconds

匆匆过客 提交于 2020-01-16 00:50:17
问题 I'm using an NVIDIA geforce card that gives an error after 2 seconds if I try to run some CUDA program on it. I read here that you can use the TDRlevel key in HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers . However, I don't see any such key in the registry. Does it needs to be added yourself? Have somebody else experienced this problem. If so, how did you solve it? Thanks. 回答1: I'm assuming you are using Windows Vista or later. The article you linked to contains a list

What is the difference between cudaMemcpy() and cudaMemcpyPeer() for P2P-copy?

ε祈祈猫儿з 提交于 2020-01-13 04:55:06
问题 I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM. As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf Peer-to-Peer Memcpy  Direct copy from pointer on GPU A to pointer on GPU B  With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)  Or cudaMemcpyAsync(…, cudaMemcpyDefault)  Also non-UVA explicit P2P copies:  cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src, int srcDevice, size_t count )  cudaError_t