I spent quite a lot of time hand-optimizing low-level integer arithmetic, with some success. For instance, my subroutine for 6x6 multiplication spends 66 ticks compared to 8