Some background info on the problem I am trying to speed up using CUDA:
I have a large number of small/moderate same-sized linear systems I need to solve independent
MATLAB provides a way to call the cublas batch interface for GPU arrays using pagefun.