pyopencl | 易学教程

Affect of local_work_size on performance and why it is

阅读更多关于 Affect of local_work_size on performance and why it is

Hello Everyone.... i am new to opencl and trying to explore more @ it. What is the work of local_work_size in openCL program and how it matters in performance. I am working on some image processing algo and for my openCL kernel i gave as size_t local_item_size = 1; size_t global_item_size = (int) (ceil((float)(D_can_width*D_can_height)/local_item_size))*local_item_size; // Process the entire lists ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,&global_item_size, &local_item_size, 0, NULL, NULL); and for same kernel when i changed size_t local_item_size = 16; keeping everything

PyOpenCL Matrix multiplication

阅读更多关于 PyOpenCL Matrix multiplication

问题 I have this code for matrix multiplication using pyopenCL. My problem is that the result is wrong in some matrices, and I dont understand why. After some research i think its related with global size of something like that but i dont understand how to set that values. For example: matrices using numpy dtype = float32 matrix 1: [[ 0.99114645 0.09327769 0.90075564 0.8913309 ] [ 0.59739089 0.13906649 0.94246316 0.65673178] [ 0.24535166 0.68942326 0.41361505 0.5789603 ] [ 0.31962237 0.17714553 0

OpenCL matrix multiplication should be faster?

阅读更多关于 OpenCL matrix multiplication should be faster?

I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS). I found studies where they got speedup >200x ( >1000 Gflops ) . ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU

pyopencl import error cffi.so undefined symbol

阅读更多关于 pyopencl import error cffi.so undefined symbol

I successfully installed the pyopencl but I am getting an import error. I am stuck here and unable to progress further. Any help would be much appreciated ImportError Traceback (most recent call last) in () 5 from __future__ import division 6 import numpy as np ----> 7 import pyopencl 8 import pyopencl.array 9 import math, time /home/highschool/anaconda2/lib/python2.7/site-packages/pyopencl-2016.2-py2.7-linux-x86_64.egg/pyopencl/ init .py in () 32 33 try: ---> 34 import pyopencl.cffi_cl as _cl 35 except ImportError: 36 import os /home/highschool/anaconda2/lib/python2.7/site-packages/pyopencl

PyOpenCL “fatal error: CL/cl.h: No such file or directory” error during installation in Windows 8 (x64)

阅读更多关于 PyOpenCL “fatal error: CL/cl.h: No such file or directory” error during installation in Windows 8 (x64)

After searching a lot for solutions to this problem, I found that this particular error has not been documented properly for Windows. So I have decided to post this issue along with the solution. Sorry if I am posting this in the wrong section. I hope this solution will help users with the PyOpenCL installation error in the future. Please note that the examples used here are for ATI Radeon GPUs that supports the AMD OpenCL SDK SDK. For other GPUs , please refer to their respective parameters and implement them as necessary. Also do not attempt to install using pip if the installation fails.

Create local array dynamic inside OpenCL kernel

阅读更多关于 Create local array dynamic inside OpenCL kernel

I have a OpenCL kernel that needs to process a array as multiple arrays where each sub-array sum is saved in a local cache array. For example, imagine the fowling array: [[1, 2, 3, 4], [10, 30, 1, 23]] Each work-group gets a array (in the exemple we have 2 work-groups); Each work-item process two array indexes (for example multiply the value index the local_id), where the work-item result is saved in a work-group shared array. __kernel void test(__global int **values, __global int *result, const int array_size){ __local int cache[array_size]; // initialise if (get_local_id(0) == 0){ for (int i

How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

阅读更多关于 How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

问题 I'm using pyOpenCL to do some complex calculations. It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB). I'm working on Mac OS X Lion (10.7.5) The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU. I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by

How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

阅读更多关于 How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

I'm using pyOpenCL to do some complex calculations. It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB). I'm working on Mac OS X Lion (10.7.5) The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU. I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item. I simplified my OpenCL code as much as possible, and from what

Can this OpenCL code be optimized?

阅读更多关于 Can this OpenCL code be optimized?

问题 I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v , two DxD matrices A and B and a constant c , return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j]) Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD matrix is about ten times faster. It's called from PyOpenCL if that makes any difference. Is anything done wrong? Could it be optimized? #define D 1000 ... __kernel void element

Can this OpenCL code be optimized?

阅读更多关于 Can this OpenCL code be optimized?

I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v , two DxD matrices A and B and a constant c , return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j]) Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD matrix is about ten times faster. It's called from PyOpenCL if that makes any difference. Is anything done wrong? Could it be optimized? #define D 1000 ... __kernel void element_mult( __global float *result, __global const float *vector, __global const float *matrix, __global const