I am new to CUDA, slowly learning how to use it, but I am trying to understand how to implement a blocked GEMM using CUDA in Python.
I have the following code here, but