I have found that manually calculating the % operator on __int128 is significantly faster than the built-in compiler operator. I will show you how
%
__int128