nvidia

Is global memory write considered atomic in CUDA?

柔情痞子 提交于 2019-12-20 01:58:07
问题 Is global memory write considered atomic or not in CUDA? Considering the following CUDA kernel code: int idx = blockIdx.x*blockDim.x+threadIdx.x; int gidx = idx%1000; globalStorage[gidx] = somefunction(idx); Is the global memory write to globalStorage atomic?, e.g. there is no race conditions such that concurrent kernel threads write to the bytes of the same variable stored in globalStorage, which could mess the results up (e.g. parial writes) ? Note that I am not talking about atomic

CUDA result returns garbage using very large array, but reports no error

爷,独闯天下 提交于 2019-12-19 21:16:52
问题 I am creating a test program that will create a device and a host array of size n and then launch a kernel that creates n threads which allocate the constant value 0.95f to each location in the device array. After completion, the device array is copied to the host array and all entries are totaled and a final total is displayed. The program below seems to work fine for array sizes up to around 60 million floats and returns the correct results very quickly, but upon reaching 70 million the

What device number should I use (0 or 1), to copy P2P (GPU0->GPU1)?

一笑奈何 提交于 2019-12-19 11:48:56
问题 What number of device do I must to set 0 or 1 in cudaSetDevice(); , to copy P2P (GPU0->GPU1) by using cudaStreamCreate(stream); cudaMemcpyPeerAsync(p1, 1, p0, 0, size, stream); ? Code: // Set device 0 as current cudaSetDevice(0); float* p0; size_t size = 1024 * sizeof(float); // Allocate memory on device 0 cudaMalloc(&p0, size); // Set device 1 as current cudaSetDevice(1); float* p1; // Allocate memory on device 1 cudaMalloc(&p1, size); // Set device 0 as current cudaSetDevice(0); // Launch

CUDA Matrix multiplication breaks for large matrices

你离开我真会死。 提交于 2019-12-19 02:23:11
问题 I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so. int size = Width*Width*sizeof(float); float* Md, *Nd, *Pd; cudaError_t err = cudaSuccess; //Allocate Device Memory for M, N and P err = cudaMalloc((void**)&Md, size); err = cudaMalloc((void**)&Nd, size); err = cudaMalloc((void**)&Pd,

How to interrupt or cancel a CUDA kernel from host code

北城余情 提交于 2019-12-19 02:03:21
问题 I am working with CUDA and I am trying to stop my kernels work (i.e. terminate all running threads) after a certain if block is being hit. How can I do that? I am really stuck in here. 回答1: I assume you want to stop a running kernel (not a single thread). The simplest approach (and the one that I suggest) is to set up a global memory flag which is been tested by the kernel. You can set the flag using cudaMemcpy() (or without if using unified memory). Like the following: if (gm_flag) { _

Inter-block barrier on CUDA

故事扮演 提交于 2019-12-18 16:59:35
问题 I want to implement a Inter-block barrier on CUDA, but encountering a serious problem. I cannot figure out why it does not work. #include <iostream> #include <cstdlib> #include <ctime> #define SIZE 10000000 #define BLOCKS 100 using namespace std; struct Barrier { int *count; __device__ void wait() { atomicSub(count, 1); while(*count) ; } Barrier() { int blocks = BLOCKS; cudaMalloc((void**) &count, sizeof(int)); cudaMemcpy(count, &blocks, sizeof(int), cudaMemcpyHostToDevice); } ~Barrier() {

Equivalent of cudaGetErrorString for cuBLAS?

大憨熊 提交于 2019-12-18 16:38:34
问题 CUDA runtime has a convenience function cudaGetErrorString(cudaError_t error) that translates an error enum into a readable string. cudaGetErrorString is used in the CUDA_SAFE_CALL(someCudaFunction()) macro that many people use for CUDA error handling. I'm familiarizing myself with cuBLAS now, and I'd like to create a macro similar to CUDA_SAFE_CALL for cuBLAS. To make my macro's printouts useful, I'd like to have something analogous to cudaGetErrorString in cuBLAS. Is there an equivalent of

CUDA streams destruction and CudaDeviceReset

独自空忆成欢 提交于 2019-12-18 14:02:15
问题 I have implemented the following class using CUDA streams class CudaStreams { private: int nStreams_; cudaStream_t* streams_; cudaStream_t active_stream_; public: // default constructor CudaStreams() { } // streams initialization void InitStreams(const int nStreams = 1) { nStreams_ = nStreams; // allocate and initialize an array of stream handles streams_ = (cudaStream_t*) malloc(nStreams_*sizeof(cudaStream_t)); for(int i = 0; i < nStreams_; i++) CudaSafeCall(cudaStreamCreate(&(streams_[i])))

CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

安稳与你 提交于 2019-12-18 12:05:27
问题 This question already has answers here : CUDA: How many concurrent threads in total? (3 answers) Closed 4 years ago . We have a workstation with two Nvidia Quadro FX 5800 cards installed. Running the deviceQuery CUDA sample reveals that the maximum threads per multiprocessor (SM) is 1024, while the maximum threads per block is 512. Given that only one block can be executed on each SM at a time, why is max threads / processor double the max threads / block? How do we utilise the other 512

CUDA atomic operation performance in different scenarios

▼魔方 西西 提交于 2019-12-18 11:12:28
问题 When I came across this question on SO, I was curious to know the answer. so I wrote below piece of code to test atomic operation performance in different scenarios. The OS is Ubuntu 12.04 with CUDA 5.5 and the device is GeForce GTX780 (Kepler architecture). I compiled the code with -O3 flag and for CC=3.5. #include <stdio.h> static void HandleError( cudaError_t err, const char *file, int line ) { if (err != cudaSuccess) { printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line