I have a program which has the option to use either openCL or openMP on some key bottlenecks, basically adding vectors and performing reductions.
In my case, openMP takes 13 seconds where openCL takes 10 seconds, on the CPU. Intel I5.
The fastest configuration for me so far is to add the vectors using openCL GPU, and do the reductions on openMP getting me down to 7 seconds. When I do the reduction on the openCL kernel, on GPU, it takes a total of 8 seconds.
So from my experience I would say maybe it depends on the use, and much you can optimize your openCL kernel.