For your original questions:
- The code is slow because it involves the conversion from integer to
float data types. That's why it's easily sped up when you use also
an integer datatype for the sum-variables because it doesn't require
a float-conversion anymore.
- The difference is the result of several
factors. For example it depends on how efficient a platform is able
to perform an int->float conversion. Furthermore this conversion
could also mess up processor-internal optimizations in the program
flow and prediction engine, caches, ... and also the internal
parallelizing-features of the processors can have a huge influence in
such calculations.
For the additional questions:
- "Surprisingly int is faster than uint_fast32_t"? What's the
sizeof(size_t) and sizeof(int) on your platform? One guess I can make is, that both are
probably 64bit and therefore a cast to 32bit not only can give you
calculation errors but also includes a different-size-casting
penalty.
In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable.
In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.