I am working on the GPU algorithm which is supposed to do a lot of modular computations. Particularly, various operations on matrices in a finite field which in the long run red
A high-end Fermi GPU (e.g. a GTX 580) will likely give you the best performance among shipping cards for this. You would want all 32-bit operands to be of type "unsigned int" for best performance, as there is some additional overhead for the handling of signed divisions and modulos.
The compiler generates very efficient code for division and modulo with fixed divisor As I recall it is usually around three to five machine instructions instructions on Fermi and Kepler. You can check the generated SASS (machine code) with cuobjdump --dump-sass. You might be able to use templated functions with constant divisors if you only use a few different divisors.
You should see on the order of sixteen inlined SASS instructions being generated for the unsigned 32-bit operations with variable divisor, across Fermi and Kepler. The code is limited by the throughput of integer multiplies and for Fermi-class GPUs is competitive with hardware solutions. Somewhat reduced performance is seen on currently shipping Kepler-class GPUs due to their reduced integer multiply throughput.
[Added later, after clarification of the question:]
Unsigned 64-bit division and modulo with variable divisor on the other hand are called subroutines of about 65 instructions on Fermi and Kepler. They look close to optimal. On Fermi, this is still reasonably competitive with hardware implementations (note that 64-bit integer divisions are not exactly super fast on CPUs that provide this as a built-in instruction). Below is some code that I posted to the NVIDIA forums some time back for the kind of task described in the clarification. It avoids the expensive division, but does assume that fairly large batches of operands are sharing the same divisior. It uses double-precision arithmetic, which is especially fast on Tesla-class GPUs (as opposed to consumer cards). I only did a cursory test of the code, you might want to test this more carefully before deploying it.
// Let b, p, and A[i] be integers < 2^51
// Let N be a integer on the order of 10000
// for i from 1 to N
// A[i] <-- A[i] * b mod p
/*---- kernel arguments ----*/
unsigned long long *A;
double b, p; /* convert from unsigned long long to double before passing to kernel */
double oop; /* pass precomputed 1.0/p to kernel */
/*---- code inside kernel -----*/
double a, q, h, l, rem;
const double int_cvt_magic = 6755399441055744.0; /* 2^52+2^51 */
a = (double)A[i];
/* approximate quotient and round it to the nearest integer */
q = __fma_rn (a * b, oop, int_cvt_magic);
q = q - int_cvt_magic;
/* back-multiply, representing p*q as a double-double h:l exactly */
h = p * q;
l = __fma_rn (p, q, -h);
/* remainder is double-width product a*b minus double-double h:l */
rem = __fma_rn (a, b, -h);
rem = rem - l;
/* remainder may be negative as quotient rounded; fix if necessary */
if (rem < 0.0) rem += p;
A[i] = (unsigned long long)rem;