I'm writing implementation of Sieve of Eratosthenes (https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) on GPU. But no sth like this - http://developer-resource.blogspot.com/2008/07/cuda-sieve-of-eratosthenes.html
Method:
- Creating n-element array with default values 0/1 (0 - prime, 1 - no) and passing it on GPU (I know that it can be done directly in kernel but it's not problem in this moment).
- Each thread in block checks multiples of a single number. Each block checks in total sqrt(n) possibilities. Each block == different interval.
- Marking multiples as 1 and passing data back to the host.
Code:
#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024
__global__ void kernel(int *global, int threads) {
extern __shared__ int cache[];
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
cache[tid - 1] = global[number];
__syncthreads();
int start = offset + 1;
int end = offset + threads;
for (int i = start; i <= end; i++) {
if ((i != tid) && (tid != 1) && (i % tid == 0)) {
cache[i - offset - 1] = 1;
}
}
__syncthreads();
global[number] = cache[tid - 1];
}
int main(int argc, char *argv[]) {
int *array, *dev_array;
int n = atol(argv[1]);
int n_sqrt = floor(sqrt((double)n));
size_t array_size = n * sizeof(int);
array = (int*) malloc(n * sizeof(int));
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
array[i] = 0;
}
cudaMalloc((void**)&dev_array, array_size);
cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
int shared = threads * sizeof(int);
kernel<<<blocks, threads, shared>>>(dev_array, threads);
cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);
int count = 0;
for (int i = 0; i < n; i++) {
if (array[i] == 0) {
count++;
}
}
printf("Count: %d\n", count);
return 0;
}
Run:
./sieve 10240000
It works correctly when n = 16, 64, 1024, 102400... but for n = 10240000 I getting incorrect result. Where is problem?
This code has a variety of problems, in my view.
You are fundamentally accessing items out of range. Consider this sequence in your kernel:
int tid = threadIdx.x + 1; int offset = blockIdx.x * blockDim.x; int number = offset + tid; cache[tid - 1] = global[number];
You (in some cases -- see below) have launched a thread array exactly equal in size to your
global
array. So what happens when the highest numbered thread runs the above code?number
=threadIdx.x+1+blockIdx.x*blockDim.x
. Thisnumber
index will be one beyond the end of your array. This is true for many possible values ofn
. This problem would have been evident to you if you had either used proper cuda error checking or had run your code withcuda-memcheck
. You should always do those things when you are having trouble with a CUDA code and also before asking for help from others.The code only has a chance of working correctly if the input
n
is a perfect square. The reason for this is contained in these lines of code (as well as dependencies in the kernel):int n = atol(argv[1]); int n_sqrt = floor(sqrt((double)n)); ... int threads = min(n_sqrt, THREADS); int blocks = n / threads;
(note that the correct function here would be
atoi
notatol
, but I digress...) Unlessn
is a perfect square, the resultantn_sqrt
will be somewhat less than the actual square root ofn
. This will lead you to compute a total thread array that is smaller than the necessary size. (It's OK if you don't believe me at this point. Run the code I will post below and input a size like 1025, then see if the number of threads * blocks is of sufficient size to cover an array of 1025.)As you've stated:
Each block checks in total sqrt(n) possibilities.
Hopefully this also points out the danger of non-perfect square
n
, but we must now ask "what ifn
is larger than the square of the largest threadblock size (1024)? The answer is that the code will not work correctly in many cases - and your chosen input of 10240000, although a perfect square, exceeds 1024^2 (1048576) and it does not work for this reason. Your algorithm (which I claim is not a Sieve of Eratosthenes) requires that each block be able to checksqrt(n)
possibilities, just as you stated in the question. When that no longer becomes possible because of the limits of threads per block, then your algorithm starts to break.
Here is a code that makes some attempt to fix issue #1 above, and at least give an explanation for the failures associated with #2 and #3:
#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024
#define MAX 10240000
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void kernel(int *global, int threads) {
extern __shared__ int cache[];
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
if ((blockIdx.x != (gridDim.x-1)) || (threadIdx.x != (blockDim.x-1))){
cache[tid - 1] = global[number];
__syncthreads();
int start = offset + 1;
int end = offset + threads;
for (int i = start; i <= end; i++) {
if ((i != tid) && (tid != 1) && (i % tid == 0)) {
cache[i - offset - 1] = 1;
}
}
__syncthreads();
global[number] = cache[tid - 1];}
}
int cpu_sieve(int n){
int limit = floor(sqrt(n));
int *test_arr = (int *)malloc(n*sizeof(int));
if (test_arr == NULL) return -1;
memset(test_arr, 0, n*sizeof(int));
for (int i = 2; i < limit; i++)
if (!test_arr[i]){
int j = i*i;
while (j <= n){
test_arr[j] = 1;
j += i;}}
int count = 0;
for (int i = 2; i < n; i++)
if (!test_arr[i]) count++;
return count;
}
int main(int argc, char *argv[]) {
int *array, *dev_array;
if (argc != 2) {printf("must supply n as command line parameter\n"); return 1;}
int n = atoi(argv[1]);
if ((n < 1) || (n > MAX)) {printf("n out of range %d\n", n); return 1;}
int n_sqrt = floor(sqrt((double)n));
size_t array_size = n * sizeof(int);
array = (int*) malloc(n * sizeof(int));
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
array[i] = 0;
}
cudaMalloc((void**)&dev_array, array_size);
cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
int shared = threads * sizeof(int);
printf("threads = %d, blocks = %d\n", threads, blocks);
kernel<<<blocks, threads, shared>>>(dev_array, threads);
cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);
cudaCheckErrors("some error");
int count = 0;
for (int i = 0; i < n; i++) {
if (array[i] == 0) {
count++;
}
}
printf("Count: %d\n", count);
printf("CPU Sieve: %d\n", cpu_sieve(n));
return 0;
}
There are a couple of issues, I think, but here's a pointer to the actual problem: The sieve of Eratosthenes removes iteratively multiples of already encountered prime numbers, and you want to separate the work-load into thread-blocks, where each thread-block operates on a piece of shared memory (cache, in your example). Thread-blocks, however, are generally independent from all other thread-blocks and cannot easily communicate with one another. One example to illustrate the problem: The thread with index 0 in thread-block with index 0 removes multiples of 2. Thread blocks with index > 0 have no way to know about this.
来源:https://stackoverflow.com/questions/30962293/cuda-sieve-of-eratosthenes-division-into-parts