I am trying to implement blocked (tiled) matrix multiplication on a single processor. I have read the literature on why blocking improves memory performance, but I just want