I am trying to declare a variable for matrix multiplication as follows:
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
I am trying to make it
extern __shared__ int buf[];
when you launch the kernel you should launch it this way;
kernel<<<blocks,threads,numbytes_for_shared>>>(...);
If you have multiple extern declaration of shared:
extern __shared__ float As[];
extern __shared__ float Bs[];
this will lead to As
pointing to the same address as Bs
.
You will need to keep As and Bs inside the 1D-array.
extern __shared__ float smem[];
When calling kernel, you should launch it with 2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float)
.
When indexing into As, use smem[y*BLOCK_SIZE+x]
and when indexing into Bs use smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]
You have two choices for declaring shared memory inside a kernel - static or dynamic. I presume what you are doing at the moment looks something like this:
#define BLOCK_SIZE (16)
__global__ void sgemm0(const float *A, const float *B, float *C)
{
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
}
and you would like to be able to easily change BLOCK_SIZE.
One possibility is to continue to use static shared memory allocation, but make the allocation size a template parameter, like this:
template<int blocksize=16>
__global__ void sgemm1(const float *A, const float *B, float *C)
{
__shared__ float As[blocksize][blocksize];
}
template void sgemm1<16>(const float *, const float *, float *C);
Then you can instantiate as many different block size variants at compile time as you need.
If you want to dynamically allocate the memory, define it like this:
__global__ void sgemm2(const float *A, const float *B, float *C)
{
extern __shared__ float As[];
}
and then add the size of the allocation as an argument to the kernel call:
size_t blocksize = BLOCK_SIZE * BLOCK_SIZE;
sgemm2<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);
If you have multiple statically declared arrays which you wish to replace with dynamically allocated shared memory, then be aware that there is only ever one dynamic shared memory allocation per kernel, so multiple items exits within (share) that memory segment. So if you had something like:
#define BLOCK_SIZE (16)
__global__ void sgemm0(const float *A, const float *B, float *C)
{
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
}
you could replace it with:
#define BLOCK_SIZE (16)
__global__ void sgemm3(const float *A, const float *B, float *C)
{
extern __shared__ float buffer[];
float *As = &buffer[0];
float *Bs = &buffer[BLOCK_SIZE*BLOCK_SIZE];
}
and launch the kernel like this:
size_t blocksize = 2 * BLOCK_SIZE * BLOCK_SIZE;
sgemm3<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);
All are equally valid, although I personally favour the template version because it can allow other compiler optimisation like automatic loop unrolling that the dynamic version cannot without extra work.
Sounds correct.
Generally in this case you'll need to malloc something.
There are two things here, one C doesn't know about 2D arrays (it's just an array of arrays) and array sizes need to compile time constants (or something the compiler can calculate at compile time).
If you are using C99 you can declare the array size using a parameter of the function, but C99 support is... spotty at best.