This is program for matrix multiplication on CUDA architecture. This code is working fine when size of array is 30 x 30 but giving output as a series of 0\'s when size is greate
You probably have a max of 1024 threads per block on your GPU. 30 x 30 = 900, so that should be OK, but e.g. 40 x 40 would results in a kernel launch failure (take-home message: always check for errors !).
You probably want to consider organizing your data differently, e.g. SIZE
blocks of SIZE
threads and then call the kernel as:
matrix_multiply<<<SIZE, SIZE>>>(c_input1,c_input2,c_result,SIZE);
(Obviously you'll need to modify your array indexing within the kernel code, e.g. use the block index as the row and the thread index as the column.)
You are invoking the kernel with a configuration of 1 grid with size 30x30:
matrix_multiply<<<1, SIZE * SIZE>>>(c_input1,c_input2,c_result,SIZE);
There are not enough threads to process more.