I am creating a routine that calls the Sum Reduction kernel of Nvidia (reduction6), but when I compare the results between the CPU and GPU get an error that increases as the vec
Floating point addition is not necessarily associative.
This means that when you change the order of operations of your floating-point summation, you may get different results. Parallelizing a summation by definition changes the order of operations of the summation.
There are many ways to sum floating-point numbers, and each has accuracy benefits for different input distributions. Here's a decent survey.
Sequential summation in the given order is rarely the most accurate way to sum, so if that is what you are comparing against, don't expect it to compare well to the tree-based summation used in a typical parallel reduction.