pyopencl

Affect of local_work_size on performance and why it is

自古美人都是妖i 提交于 2019-12-06 12:24:27
Hello Everyone.... i am new to opencl and trying to explore more @ it. What is the work of local_work_size in openCL program and how it matters in performance. I am working on some image processing algo and for my openCL kernel i gave as size_t local_item_size = 1; size_t global_item_size = (int) (ceil((float)(D_can_width*D_can_height)/local_item_size))*local_item_size; // Process the entire lists ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,&global_item_size, &local_item_size, 0, NULL, NULL); and for same kernel when i changed size_t local_item_size = 16; keeping everything

PyOpenCL Matrix multiplication

∥☆過路亽.° 提交于 2019-12-06 11:37:45
问题 I have this code for matrix multiplication using pyopenCL. My problem is that the result is wrong in some matrices, and I dont understand why. After some research i think its related with global size of something like that but i dont understand how to set that values. For example: matrices using numpy dtype = float32 matrix 1: [[ 0.99114645 0.09327769 0.90075564 0.8913309 ] [ 0.59739089 0.13906649 0.94246316 0.65673178] [ 0.24535166 0.68942326 0.41361505 0.5789603 ] [ 0.31962237 0.17714553 0

OpenCL matrix multiplication should be faster?

安稳与你 提交于 2019-12-05 14:14:17
I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS). I found studies where they got speedup >200x ( >1000 Gflops ) . ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU

pyopencl import error cffi.so undefined symbol

﹥>﹥吖頭↗ 提交于 2019-12-04 14:43:27
I successfully installed the pyopencl but I am getting an import error. I am stuck here and unable to progress further. Any help would be much appreciated ImportError Traceback (most recent call last) in () 5 from __future__ import division 6 import numpy as np ----> 7 import pyopencl 8 import pyopencl.array 9 import math, time /home/highschool/anaconda2/lib/python2.7/site-packages/pyopencl-2016.2-py2.7-linux-x86_64.egg/pyopencl/ init .py in () 32 33 try: ---> 34 import pyopencl.cffi_cl as _cl 35 except ImportError: 36 import os /home/highschool/anaconda2/lib/python2.7/site-packages/pyopencl

PyOpenCL “fatal error: CL/cl.h: No such file or directory” error during installation in Windows 8 (x64)

此生再无相见时 提交于 2019-12-04 14:41:13
After searching a lot for solutions to this problem, I found that this particular error has not been documented properly for Windows. So I have decided to post this issue along with the solution. Sorry if I am posting this in the wrong section. I hope this solution will help users with the PyOpenCL installation error in the future. Please note that the examples used here are for ATI Radeon GPUs that supports the AMD OpenCL SDK SDK. For other GPUs , please refer to their respective parameters and implement them as necessary. Also do not attempt to install using pip if the installation fails.

Create local array dynamic inside OpenCL kernel

天大地大妈咪最大 提交于 2019-12-03 15:05:50
I have a OpenCL kernel that needs to process a array as multiple arrays where each sub-array sum is saved in a local cache array. For example, imagine the fowling array: [[1, 2, 3, 4], [10, 30, 1, 23]] Each work-group gets a array (in the exemple we have 2 work-groups); Each work-item process two array indexes (for example multiply the value index the local_id), where the work-item result is saved in a work-group shared array. __kernel void test(__global int **values, __global int *result, const int array_size){ __local int cache[array_size]; // initialise if (get_local_id(0) == 0){ for (int i

How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

我们两清 提交于 2019-12-02 10:38:03
问题 I'm using pyOpenCL to do some complex calculations. It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB). I'm working on Mac OS X Lion (10.7.5) The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU. I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by

How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

。_饼干妹妹 提交于 2019-12-02 07:13:17
I'm using pyOpenCL to do some complex calculations. It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB). I'm working on Mac OS X Lion (10.7.5) The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU. I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item. I simplified my OpenCL code as much as possible, and from what

Can this OpenCL code be optimized?

邮差的信 提交于 2019-12-01 18:18:52
问题 I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v , two DxD matrices A and B and a constant c , return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j]) Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD matrix is about ten times faster. It's called from PyOpenCL if that makes any difference. Is anything done wrong? Could it be optimized? #define D 1000 ... __kernel void element

Can this OpenCL code be optimized?

旧城冷巷雨未停 提交于 2019-12-01 18:04:29
I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v , two DxD matrices A and B and a constant c , return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j]) Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD matrix is about ten times faster. It's called from PyOpenCL if that makes any difference. Is anything done wrong? Could it be optimized? #define D 1000 ... __kernel void element_mult( __global float *result, __global const float *vector, __global const float *matrix, __global const