I know it sound weird, but here is my scenario:
I need to do a matrix-matrix multiplication (A(n*k)*B(k*n)), but I only needs the diagonal elements to be evaluated f
Yes it can.
"The language interface and Device Runtime API available in CUDA C/C++ is a subset of the CUDA Runtime API available on the Host. The syntax and semantics of the CUDA Runtime API have been retained on the device in order to facilitate ease of code reuse for API routines that may run in either the host or device environments. A kernel can also call GPU libraries such as CUBLAS directly without needing to return to the CPU." Source
Here you can see and Matrix-Vector Multiplication using cuda and CUBLAS library function cublasSgemv.