gpgpu

Making some, but not all, (CUDA) memory accesses uncached

痴心易碎 提交于 2020-03-22 08:21:19
问题 I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO). Can this be done... For a single kernel individually? At run time rather than at compile time? For writes only rather than for reads and writes? 回答1: Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load

Making some, but not all, (CUDA) memory accesses uncached

空扰寡人 提交于 2020-03-22 08:20:29
问题 I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO). Can this be done... For a single kernel individually? At run time rather than at compile time? For writes only rather than for reads and writes? 回答1: Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load

Making some, but not all, (CUDA) memory accesses uncached

≯℡__Kan透↙ 提交于 2020-03-22 08:20:14
问题 I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO). Can this be done... For a single kernel individually? At run time rather than at compile time? For writes only rather than for reads and writes? 回答1: Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load

如何获得cuda版本?

送分小仙女□ 提交于 2020-02-28 07:55:48
是否有任何快速命令或脚本来检查安装的CUDA版本? 我在安装目录下找到了4.0的手册,但我不确定它是否是实际安装的版本。 #1楼 安装CUDA之后,可以通过以下方式检查版本:nvcc -V 我安装了5.0和5.5,所以它给了 Cuda编译工具,5.5版,V5.5,0 此命令适用于Windows和Ubuntu。 #2楼 除了上面提到的那些,您的CUDA安装路径(如果在安装过程中未更改)通常包含版本号 做一个 which nvcc 应该给出路径,这将给你版本 PS:这是一种快速而肮脏的方式,上面的答案更加优雅,并且会产生相当大的努力 #3楼 您可能会发现CUDA-Z很有用,这里是他们网站的引用: “这个程序诞生时模仿了另一个Z-utilities,如CPU-Z和GPU-Z.CUDA-Z显示了一些关于支持CUDA的GPU和GPGPU的基本信息。它适用于nVIDIA Geforce,Quadro和Tesla卡,ION芯片组“。 http://cuda-z.sourceforge.net/ 在支持选项卡上有源代码的URL: http : //sourceforge.net/p/cuda-z/code/ ,下载实际上不是安装程序,而是可执行文件本身(没有安装,所以这是“快速” “)。 该实用程序提供了大量信息,如果您需要知道它是如何派生的,那么可以查看源代码。 您可以搜索其他与此类似的实用程序

Get statistics for a list of numbers using GPU

心不动则不痛 提交于 2020-02-04 09:25:06
问题 I have several lists of numbers on a file . For example, .333, .324, .123 , .543, .00054 .2243, .333, .53343 , .4434 Now, I want to get the number of times each number occurs using the GPU. I believe this will be faster to do on the GPU than the CPU because each thread can process one list. What data structure should I use on the GPU to easily get the above counts. For example , for the above, the answer will look as follows: .333 = 2 times in entire file .324 = 1 time etc.. I looking for a

PyCUDA/CUDA: Causes of non-deterministic launch failures?

跟風遠走 提交于 2020-02-03 08:50:32
问题 Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance) Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation

PyCUDA/CUDA: Causes of non-deterministic launch failures?

余生颓废 提交于 2020-02-03 08:50:17
问题 Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance) Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation

Improving memory layout for parallel computing

断了今生、忘了曾经 提交于 2020-01-29 09:45:08
问题 I'm trying to optimize an algorithm (Lattice Boltzmann) for parallel computing using C++ AMP. And looking for some suggestions to optimize the memory layout, just found out that removing one parameter from the structure into another vector (the blocked vector) gave and increase of about 10%. Anyone got any tips that can further improve this, or something i should take into consideration? Below is the most time consuming function that is executed for each timestep, and the structure used for

CUDA FFT exception

最后都变了- 提交于 2020-01-26 03:15:10
问题 I'm trying to use CUDA FFT aka cufft library Problem occured when cufftPlan1d(..) throws an exception. #define NX 256 #define BATCH 10 cufftHandle plan; cufftComplex *data; cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH); if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); return; } if (cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: Plan creation failed"); return; } When the copiler hit the

OpenMP 4.0 for accelerators: Nvidia GPU target

◇◆丶佛笑我妖孽 提交于 2020-01-25 18:08:26
问题 I'm trying to use openMP for accelerators (openMP 4.0) in Visual Studio 2012, using the Intel C++ 15.0 compiler. My accelerator is an Nvidia GeForce GTX 670. This code does not compile: #include <stdio.h> #include<iostream> #include <omp.h> using namespace std; int main(){ #pragma omp target #pragma omp parallel for for (int i=0; i<1000; i++) cout<<"Hello world, i am number "<< i <<endl; } Of course, everything goes fine when I comment the #pragma omp target line. I get the same problem when