nvidia | 易学教程

Is global memory write considered atomic in CUDA?

阅读更多关于 Is global memory write considered atomic in CUDA?

问题 Is global memory write considered atomic or not in CUDA? Considering the following CUDA kernel code: int idx = blockIdx.x*blockDim.x+threadIdx.x; int gidx = idx%1000; globalStorage[gidx] = somefunction(idx); Is the global memory write to globalStorage atomic?, e.g. there is no race conditions such that concurrent kernel threads write to the bytes of the same variable stored in globalStorage, which could mess the results up (e.g. parial writes) ? Note that I am not talking about atomic

CUDA result returns garbage using very large array, but reports no error

阅读更多关于 CUDA result returns garbage using very large array, but reports no error

问题 I am creating a test program that will create a device and a host array of size n and then launch a kernel that creates n threads which allocate the constant value 0.95f to each location in the device array. After completion, the device array is copied to the host array and all entries are totaled and a final total is displayed. The program below seems to work fine for array sizes up to around 60 million floats and returns the correct results very quickly, but upon reaching 70 million the

What device number should I use (0 or 1), to copy P2P (GPU0->GPU1)?

阅读更多关于 What device number should I use (0 or 1), to copy P2P (GPU0->GPU1)?

问题 What number of device do I must to set 0 or 1 in cudaSetDevice(); , to copy P2P (GPU0->GPU1) by using cudaStreamCreate(stream); cudaMemcpyPeerAsync(p1, 1, p0, 0, size, stream); ? Code: // Set device 0 as current cudaSetDevice(0); float* p0; size_t size = 1024 * sizeof(float); // Allocate memory on device 0 cudaMalloc(&p0, size); // Set device 1 as current cudaSetDevice(1); float* p1; // Allocate memory on device 1 cudaMalloc(&p1, size); // Set device 0 as current cudaSetDevice(0); // Launch

CUDA Matrix multiplication breaks for large matrices

阅读更多关于 CUDA Matrix multiplication breaks for large matrices

问题 I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so. int size = Width*Width*sizeof(float); float* Md, *Nd, *Pd; cudaError_t err = cudaSuccess; //Allocate Device Memory for M, N and P err = cudaMalloc((void**)&Md, size); err = cudaMalloc((void**)&Nd, size); err = cudaMalloc((void**)&Pd,

How to interrupt or cancel a CUDA kernel from host code

阅读更多关于 How to interrupt or cancel a CUDA kernel from host code

问题 I am working with CUDA and I am trying to stop my kernels work (i.e. terminate all running threads) after a certain if block is being hit. How can I do that? I am really stuck in here. 回答1: I assume you want to stop a running kernel (not a single thread). The simplest approach (and the one that I suggest) is to set up a global memory flag which is been tested by the kernel. You can set the flag using cudaMemcpy() (or without if using unified memory). Like the following: if (gm_flag) { _

Inter-block barrier on CUDA

阅读更多关于 Inter-block barrier on CUDA

问题 I want to implement a Inter-block barrier on CUDA, but encountering a serious problem. I cannot figure out why it does not work. #include <iostream> #include <cstdlib> #include <ctime> #define SIZE 10000000 #define BLOCKS 100 using namespace std; struct Barrier { int *count; __device__ void wait() { atomicSub(count, 1); while(*count) ; } Barrier() { int blocks = BLOCKS; cudaMalloc((void**) &count, sizeof(int)); cudaMemcpy(count, &blocks, sizeof(int), cudaMemcpyHostToDevice); } ~Barrier() {

Equivalent of cudaGetErrorString for cuBLAS?

阅读更多关于 Equivalent of cudaGetErrorString for cuBLAS?

问题 CUDA runtime has a convenience function cudaGetErrorString(cudaError_t error) that translates an error enum into a readable string. cudaGetErrorString is used in the CUDA_SAFE_CALL(someCudaFunction()) macro that many people use for CUDA error handling. I'm familiarizing myself with cuBLAS now, and I'd like to create a macro similar to CUDA_SAFE_CALL for cuBLAS. To make my macro's printouts useful, I'd like to have something analogous to cudaGetErrorString in cuBLAS. Is there an equivalent of

CUDA streams destruction and CudaDeviceReset

阅读更多关于 CUDA streams destruction and CudaDeviceReset

问题 I have implemented the following class using CUDA streams class CudaStreams { private: int nStreams_; cudaStream_t* streams_; cudaStream_t active_stream_; public: // default constructor CudaStreams() { } // streams initialization void InitStreams(const int nStreams = 1) { nStreams_ = nStreams; // allocate and initialize an array of stream handles streams_ = (cudaStream_t*) malloc(nStreams_*sizeof(cudaStream_t)); for(int i = 0; i < nStreams_; i++) CudaSafeCall(cudaStreamCreate(&(streams_[i])))

CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

阅读更多关于 CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

问题 This question already has answers here : CUDA: How many concurrent threads in total? (3 answers) Closed 4 years ago . We have a workstation with two Nvidia Quadro FX 5800 cards installed. Running the deviceQuery CUDA sample reveals that the maximum threads per multiprocessor (SM) is 1024, while the maximum threads per block is 512. Given that only one block can be executed on each SM at a time, why is max threads / processor double the max threads / block? How do we utilise the other 512

CUDA atomic operation performance in different scenarios

阅读更多关于 CUDA atomic operation performance in different scenarios

问题 When I came across this question on SO, I was curious to know the answer. so I wrote below piece of code to test atomic operation performance in different scenarios. The OS is Ubuntu 12.04 with CUDA 5.5 and the device is GeForce GTX780 (Kepler architecture). I compiled the code with -O3 flag and for CC=3.5. #include <stdio.h> static void HandleError( cudaError_t err, const char *file, int line ) { if (err != cudaSuccess) { printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line