gpu-programming | 易学教程

printf inside CUDA global function

阅读更多关于 printf inside CUDA __global__ function

问题 I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function: __global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){ int tx = threadIdx.x; int ty = threadIdx.y; int bx = blockIdx.x; int by = blockIdx.y; float sum = 0; for( int k = 0; k < Ad.width ; ++k){ float Melement = Ad.elements[ty * Ad.width

nvidia-smi Volatile GPU-Utilization explanation?

阅读更多关于 nvidia-smi Volatile GPU-Utilization explanation?

问题 I know that nvidia-smi -l 1 will give the GPU usage every one second (similarly to the following). However, I would appreciate an explanation on what Volatile GPU-Util really means. Is that the number of used SMs over total SMs, or the occupancy, or something else? +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.48 Driver Version: 367.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M|

Controlling the index variables in C++ AMP

阅读更多关于 Controlling the index variables in C++ AMP

问题 I have just started trying C++ AMP and I decided to give it a shot with the current project I am working on. At some point, I have to build a distance matrix for the vectors I have and I have written the code below for this unsigned int samplesize=samplelist.size(); unsigned int vs = samplelist.front().size(); vector<double> samplevec(samplesize*vs); vector<double> distancevec(samplesize*samplesize,0); it1=samplelist.begin(); for(int i=0 ; i<samplesize; ++i){ for(int j = 0 ; j<vs ; ++j){

Some child grids not being executed with CUDA Dynamic Parallelism

阅读更多关于 Some child grids not being executed with CUDA Dynamic Parallelism

问题 I'm experimenting with the new Dynamic Parallelism feature in CUDA 5.0 (GTK 110). I face the strange behavior that my program does not return the expected result for some configurations—not only unexpected, but also a different result with each launch. Now I think I found the source of my problem: It seems that some child girds (kernels launched by other kernels) are sometimes not executed when too many child grids are spawned at the same time . I wrote a little test program to illustrate

How to run a prediction on GPU?

阅读更多关于 How to run a prediction on GPU?

问题 I am using h2o4gpu and the parameters which i have set are h2o4gpu.solvers.xgboost.RandomForestClassifier model. XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1.0, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=8, min_child_weight=1, missing=nan, n_estimators=100, n_gpus=1, n_jobs=-1, nthread=None, num_parallel_tree=1, num_round=1, objective='binary:logistic', predictor='gpu_predictor', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos

F#/“Accelerator v2” DFT algorithm implementation probably incorrect

阅读更多关于 F#/“Accelerator v2” DFT algorithm implementation probably incorrect

问题 I'm trying to experiment with software defined radio concepts. From this article I've tried to implement a GPU-parallelism Discrete Fourier Transform. I'm pretty sure I could pre-calculate 90 degrees of the sin(i) cos(i) and then just flip and repeat rather than what I'm doing in this code and that that would speed it up. But so far, I don't even think I'm getting correct answers. An all-zeros input gives a 0 result as I'd expect, but all 0.5 as inputs gives 78.9985886f (I'd expect a 0 result

How could we generate random numbers in CUDA C with different seed on each run?

阅读更多关于 How could we generate random numbers in CUDA C with different seed on each run?

问题 I am working on a stochastic process and I wanted to generate different series if random numbers in CUDA kernel each time I run the program. This similar to what we does in C++ by declaring seed = time(null) followed by srand(seed) and rand( ) I can pass seeds from host to device via the kernel but the problem in doing this is I would have to pass an entire array of seeds into the kernel for each thread to have a different random seed each time. Is there a way I could generate random seed /

Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar? [closed]

阅读更多关于 Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . In the documentation for CUDA 6.5 has written: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz3PIXMTktb 5.2.3. Multiprocessor Level ... 8L for devices of compute capability 3.x since a multiprocessor issues a pair of instructions per warp over one clock cycle for four warps at a time, as

Read/Write OpenCL memory buffers on multiple GPU in a single context

阅读更多关于 Read/Write OpenCL memory buffers on multiple GPU in a single context

问题 Assume a system with two distinct GPUs, but from the same vendor so they can be accessed from a single OpenCL Platform. Given the following simplified OpenCL code: float* someRawData; cl_device_id gpu1 = clGetDeviceIDs(0,...); cl_device_id gpu2 = clGetDeviceIDs(1,...); cl_context ctx = clCreateContext(gpu1,gpu2,...); cl_command_queue queue1 = clCreateCommandQueue(ctx,gpu1,...); cl_command_queue queue2 = clCreateCommandQueue(ctx,gpu2,...); cl_mem gpuMem = clCreateBuffer(ctx, CL_MEM_READ_WRITE,

Is there an opencl profiler for mac os X 10.8?

阅读更多关于 Is there an opencl profiler for mac os X 10.8?

问题 I am trying to find the bottleneck in my OpenCL kernel, is it possible to profile OpenCL programms on mac os X? I found gDebugger on http://www.gremedy.com/, but it requires 10.5 or 10.6 to run. AMD SDK supports only Linux and Windows. Is there a profiler for Mountain Lion? 回答1: How detailed must your profiling information be? Is it okay to use the built-in internal profiler? OpenCL queues can be created with the CL_QUEUE_PROFILING_ENABLE flag. This way you can see for each kernel you