This is quite an interesting question so let me set the scene. I work at The National Museum of Computing, and we have just managed to get a Cray Y-MP EL super computer from
Simply try to Zip and Unzip a big file , nothing as a heavy I/o operations can use cpu.
TLDR; The accepted answer is both inefficient and incompatible. Following algo works 100x faster.
The gcc compiler available on MAC can't run omp
. I had to install llvm (brew install llvm )
. But I didn't see CPU idle was going down while running OMP version.
Here is a screenshot while OMP version was running.
Alternatively, I used the basic POSIX thread, that can be run using any c compiler and saw almost entire CPU used up when nos of thread
= no of cores
= 4 (MacBook Pro, 2.3 GHz Intel Core i5). Here is the programme -
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define NUM_THREADS 10
#define THREAD_LOAD 100000
using namespace std;
struct prime_range {
int min;
int max;
int total;
};
void* findPrime(void *threadarg)
{
int i, primes = 0;
struct prime_range *this_range;
this_range = (struct prime_range *) threadarg;
int minLimit = this_range -> min ;
int maxLimit = this_range -> max ;
int flag = false;
while (minLimit <= maxLimit) {
i = 2;
int lim = ceil(sqrt(minLimit));
while (i <= lim) {
if (minLimit % i == 0){
flag = true;
break;
}
i++;
}
if (!flag){
primes++;
}
flag = false;
minLimit++;
}
this_range ->total = primes;
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
struct timespec start, finish;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &start);
pthread_t threads[NUM_THREADS];
struct prime_range pr[NUM_THREADS];
int rc;
pthread_attr_t attr;
void *status;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(int t=1; t<= NUM_THREADS; t++){
pr[t].min = (t-1) * THREAD_LOAD + 1;
pr[t].max = t*THREAD_LOAD;
rc = pthread_create(&threads[t], NULL, findPrime,(void *)&pr[t]);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
int totalPrimesFound = 0;
// free attribute and wait for the other threads
pthread_attr_destroy(&attr);
for(int t=1; t<= NUM_THREADS; t++){
rc = pthread_join(threads[t], &status);
if (rc) {
printf("Error:unable to join, %d" ,rc);
exit(-1);
}
totalPrimesFound += pr[t].total;
}
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("This machine calculated all %d prime numbers under %d in %lf seconds\n",totalPrimesFound, NUM_THREADS*THREAD_LOAD, elapsed);
pthread_exit(NULL);
}
Notice how the entire CPU is used up -
P.S. - If you increase no of threads then actual CPU usage go down (Try making no of threads = 20 .) as the system uses more time in context switching than actual computing.
By the way, my machine is not as beefy as @mystical (Accepted answer). But my version with basic POSIX threading works way faster than OMP one. Here is the result -
P.S. Increase threadload to 2.5 million to see CPU usage , as it completes in less than a second.
For a quick improvement on one core, remove system calls to reduce context-switching. Remove these lines:
system("clear");
printf("%d prime numbers calculated\n",primes);
The first is particularly bad, as it will spawn a new process every iteration.
The reason you're getting 15% on a hex core processor is because your code uses 1 core at 100%. 100/6 = 16.67%, which using a moving average with process scheduling (your process would be running under normal priority) could easily be reported as 15%.
Therefore, in order to use 100% cpu, you would need to use all the cores of your CPU - launch 6 parallel execution code paths for a hex core CPU and have this scale right up to however many processors your Cray machine has :)
If you want 100% CPU, you need to use more than 1 core. To do that, you need multiple threads.
Here's a parallel version using OpenMP:
I had to increase the limit to 1000000
to make it take more than 1 second on my machine.
#include <stdio.h>
#include <time.h>
#include <omp.h>
int main() {
double start, end;
double runTime;
start = omp_get_wtime();
int num = 1,primes = 0;
int limit = 1000000;
#pragma omp parallel for schedule(dynamic) reduction(+ : primes)
for (num = 1; num <= limit; num++) {
int i = 2;
while(i <= num) {
if(num % i == 0)
break;
i++;
}
if(i == num)
primes++;
// printf("%d prime numbers calculated\n",primes);
}
end = omp_get_wtime();
runTime = end - start;
printf("This machine calculated all %d prime numbers under %d in %g seconds\n",primes,limit,runTime);
return 0;
}
Output:
This machine calculated all 78498 prime numbers under 1000000 in 29.753 seconds
Here's your 100% CPU:
we really want to see how fast it can go!
Your algorithm to generate prime numbers is very inefficient. Compare it to primegen that generates the 50847534 primes up to 1000000000 in just 8 seconds on a Pentium II-350.
To consume all CPUs easily you could solve an embarrassingly parallel problem e.g., compute Mandelbrot set or use genetic programming to paint Mona Lisa in multiple threads (processes).
Another approach is to take an existing benchmark program for the Cray supercomputer and port it to a modern PC.