Custom reduction on GPU vs CPU yield different result
问题 Why am I seeing different result on GPU compare to sequential CPU? import numpy from numba import cuda from functools import reduce A = (numpy.arange(100, dtype=numpy.float64)) + 1 cuda.reduce(lambda a, b: a + b * 20)(A) # result 12952749821.0 reduce(lambda a, b: a + b * 20, A) # result 100981.0 import numba numba.__version__ # '0.34.0+5.g1762237' Similar behavior happens when using Java Stream API to parallelize reduction on CPU: int n = 10; float inputArray[] = new float[n]; ArrayList<Float