I make following this original post : PyCuda code to invert a high number of 3x3 matrixes. The code suggested as an answer is :
$ cat t14.py
import numpy as np
i
Preliminaries
First, its important to understand the arithmetic of a 3x3 matrix inversion, see here (and below).
The general methodology used for kernel design is to assign one matrix result element per thread. Therefore I will need 9 threads per matrix. Ultimately each thread will be responsible for computing one of the 9 numerical results, for each matrix. In order to compute two matrices, we then need 18 threads, 3 matrices require 27 threads.
An ancillary task is to decide threadblock/grid sizing. This follows typical methods (overall problem size determines total number of threads needed), but we will make a specific choice of 288 for threadblock size, as this is a convenient multiple of both 9 (number of threads per matrix) and 32 (number of threads per warp in CUDA), which gives us a certain measure of efficiency (no wasted threads, no gaps in data storage).
Since our thread strategy is one thread per matrix element, we must collectively solve the matrix inversion arithmetic using 9 threads. The major tasks are to compute the transposed matrix of cofactors, and then to compute the determinant, then do the final arithmetic (divide by the determinant) to compute each result element.
Computation of the cofactors
The first task is to compute the transposed matrix of cofactors of A
, called M
:
|a b c|
let A = |d e f|
|g h i|
|ei-fh ch-bi bf-ce|
M = |fg-di ai-cg cd-af|
|dh-eg bg-ah ae-bd|
We have 9 threads for this task, and nine elements of matrix M
to compute, so we will assign one thread to each element of M
. Each element of M
depends on multiple input values (a
, b
, c
, etc.) so we shall first load each input value (there are 9, one per thread), into shared memory:
// allocate enough shared memory for one element per thread in the block:
__shared__ T si[block_size];
// compute a globally unique thread index, so each thread has a unique number 0,1,2,etc.
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
// establish a temporary variable that will use and reuse during thread processing
T det = 1;
// do a thread check to make sure that our next load will be in-bounds for the input array in
if (idx < n*9)
// load one element per thread, 9 threads per matrix will load an entire matrix
det = in[idx];
// for a given matrix (9 threads) compute the base offset into shared memory, where this matrix data (9 elements) will be stored. All 9 threads have the same base offset
unsigned sibase = (threadIdx.x / 9)*9;
// for each group of 9 threads handling a matrix, compute for each thread in that group, a group offset or "lane" from 0..8, so each thread in the group has a unique identifier/assignment in the group
unsigned lane = threadIdx.x - sibase; // cheaper modulo
// let each thread place its matrix element a,b,c, etc. into shared memory
si[threadIdx.x] = det;
// shared memory is now loaded, make sure all threads have loaded before any calculations begin
__syncthreads();
now that each A
matrix element (a
, b
, c
, ...) is loaded into shared memory, we can start to compute the cofactors in M
. Let's focus on a particular thread (0) and its cofactor (ei-fh
). All of the needed matrix elements to compute this cofactor (e
, i
, f
, and h
) are now in shared memory. We need a method to load them in sequence, and perform the needed multiplications and subtractions.
At this point we observe two things:
M
element (cofactor) has a different set of 4 needed elements of A
M
element (cofactor) follows the same general arithmetic, given four arbitrary elements of A
, lets refer to them generically as X
, Y
, Z
and W
. The arithmetic is XY-ZW. I take the first element, multply it by the second, and then take the third and fourth element and multiply them together, then subtract the two products.Since the general sequence of operations (2, above) is the same for all 9 cofactors, we only need a method to arrange the loading of the 4 needed matrix elements. This methodology is encoded into the load patterns that are hard-coded into the example:
hpat = (0x07584, 0x08172, 0x04251, 0x08365, 0x06280, 0x05032, 0x06473, 0x07061, 0x03140)
There are 9 load patterns, each occupying a hexadecimal quantity, one load pattern per thread, i.e. one load pattern per M
matrix element (cofactor). Within a particular A
matrix, the matrix elements a
, b
, c
etc. are (already) loaded into shared memory at group offsets of 0, 1, 2, etc. The load pattern for a given thread will allow us to generate the sequence of group offsets, needed to retrieve the matrix elements of A
from their locations in shared memory, to be used in sequence to compute the cofactor assigned to that thread. Considering thread 0, and its cofactor ei-fh
, how does the load pattern 0x7584
encode the needed pattern to select e
, then i
, then f
, then h
?
For this we have a helper function getoff
which takes a load pattern, and successively (each time it is called) strips off an index. The first time I call getoff
with an argument of 0x7584
, it "strips off" the index 4, returns that, and replaces the 0x7584
load pattern with 0x758
for the next usage. 4 corresponds to e
. The next time I call getoff
with 0x758
it "strips off" the index 8, returns that, and replaces 0x758
with 0x75
. 8 corresponds to i
. The next time produces the index 5, corresponding to f
, and the last time produces the index 7, corresponding to h
.
With that description then we will walk through the code, pretending we are thread 0, and describe the process of computing ei-fh
:
// get the load pattern for my matrix "lane"
unsigned off = pat[lane];
//load my temporary variable `a` with the first item indexed in the load pattern:
T a = si[sibase + getoff(off)];
// multiply my temporary variable `a` with the second item indexed in the load pattern
a *= si[sibase + getoff(off)];
//load my temporary variable `b` with the third item indexed in the load pattern
T b = si[sibase + getoff(off)];
// multiply my temporary variable `b` with the fourth item indexed in the load pattern
b *= si[sibase + getoff(off)];
// compute the cofactor by subtracting the 2 products
a -= b;
sibase
, as already indicated in the first commented code section, is the base offset in shared memory where that A
matrix elements are stored. The getoff
function then adds to this base address to select the relevant input element.
Computation of the determinant
The numerical value of the determinant is given by:
det(A) = det = a(ei-fh) - b(di-fg) + c(dh-eg)
If we decompose this, we see that all terms are actually already computed:
a,b,c: these are input matrix elements, in shared locations (group offsets) 0, 1, 2
ei-fh: cofactor computed by thread 0
di-fg: cofactor computed by thread 3 (with sign reversed)
dh-eg: cofactor computed by thread 6
Now, every thread will need the value of the determinant because it will be used by each thread during computation of its final (result) element. Therefore we will have every thread in the matrix redundantly compute the same value (which is more efficient than computing it, say, in one thread, then broadcasting that value to the other threads). In order to facilitate this, we will need 3 of the already computed cofactors made available to all 9 threads. So we will select 3 (no longer needed) locations in shared memory to "publish" these values. We still need the values in locations 0, 1, 2 because we need the input matrix elements a
, b
, and c
for calculation of the determinant. But we no longer need the input elements in locations 3, 4, or 5 for the remainder of our work, so we will reuse those:
// we are about to change shared values, so wait until all previous usage is complete
__syncthreads();
// load cofactor computed by thread 0 into group offset 3 in shared
if (lane == 0) si[sibase+3] = a;
// load cofactor computed by thread 3 into group offset 4 in shared
if (lane == 3) si[sibase+4] = a;
// load cofactor computed by thread 6 into group offset 5 in shared
if (lane == 6) si[sibase+5] = a;
// make sure shared memory loads are complete
__syncthreads();
// let every thread compute the determinant (same for all threads)
// a * (ei-fh) + b * -(fg-di) + c * (dh-eg)
det = si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];
Calculation of final result
This only involves (for each thread) dividing the previously computed cofactor for that thread, by the just-computed determinant, and storing that result:
// another thread check: make sure this thread is actually doing useful work
if (idx < n*9)
// take previously computed cofactor, divide by determinant, store result
out[idx] = a / det;