Some background info on the problem I am trying to speed up using CUDA:
I have a large number of small/moderate same-sized linear systems I need to solve independent
Try using two or more parallel streams (with one linear system each) on the GPU, possibly this helps utilizing a bigger part of the GPU.
For timing measurments and hardware utilization use the visual profiler instead of CPU time measurements.
Another point is, that the GTX (consumer) GPUs perform pretty bad on double preision. If you have the chance, try to use a Tesla GPU instead.