I am looking for the library which supports device array or matrix operations in CUDA programming. For example, in a __device__ or __global__ function:
__device__
__global__