I am working on the GPU algorithm which is supposed to do a lot of modular computations. Particularly, various operations on matrices in a finite field which in the long run red
There are tricks to efficiently perform mod operations but if only m is radix 2.
For instance, x mod y == x & (y-1), where y is 2^n. Performing bitwise operation is the fastest.
Otherwise, probably a look-up table? Below is a link on discussion of efficient modulo implementation. You might need to implement it yourself to get the most out of it.
Efficient computation of mod