I have implemented two algorithms for sorting elements from highest to lowest.
The first one, takes quadratic time on the real RAM model and the second an O(n log(n)) time. The second one uses priority queues to get the reduction.
Here are the timings, which are the output of the above program.
- the first column is the size of the random array of integers
- the second column is the time in seconds for the O(n^2) technique
the third column is the time in seconds for the O(n log(n)) technique
9600 1.92663 7.58865 9800 1.93705 7.67376 10000 2.08647 8.19094
In spite of this great difference in complexity, the 3rd column is larger than the second for the array sizes considered. Why is this so? Is the priority queue implementation of C++ slow?
I executed this code on Windows 7, Visual Studio 2012 32-bit.
Here is the code,
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <algorithm>
#include <vector>
#include <queue>
#include <Windows.h>
#include <assert.h>
using namespace std;
double time_slower_sort(vector<int>& a)
{
LARGE_INTEGER frequency, start,end;
if (::QueryPerformanceFrequency(&frequency) == FALSE ) exit(0);
if (::QueryPerformanceCounter(&start) == FALSE ) exit(0);
for(size_t i=0 ; i < a.size() ; ++i)
{
vector<int>::iterator it = max_element( a.begin() + i ,a.end() ) ;
int max_value = *it;
*it = a[i];
a[i] = max_value;
}
if (::QueryPerformanceCounter(&end) == FALSE) exit(0);
return static_cast<double>(end.QuadPart - start.QuadPart) / frequency.QuadPart;
}
double time_faster_sort(vector<int>& a)
{
LARGE_INTEGER frequency, start,end;
if (::QueryPerformanceFrequency(&frequency) == FALSE ) exit(0);
if (::QueryPerformanceCounter(&start) == FALSE ) exit(0);
// Push into the priority queue. Logarithmic cost per insertion = > O (n log(n)) total insertion cost
priority_queue<int> pq;
for(size_t i=0 ; i<a.size() ; ++i)
{
pq.push(a[i]);
}
// Read of the elements from the priority queue in order of priority
// logarithmic reading cost per read => O(n log(n)) reading cost for entire vector
for(size_t i=0 ; i<a.size() ; ++i)
{
a[i] = pq.top();
pq.pop();
}
if (::QueryPerformanceCounter(&end) == FALSE) exit(0);
return static_cast<double>(end.QuadPart - start.QuadPart) / frequency.QuadPart;
}
int main(int argc, char** argv)
{
// Iterate over vectors of different sizes and try out the two different variants
for(size_t N=1000; N<=10000 ; N += 100 )
{
// initialize two vectors with identical random elements
vector<int> a(N),b(N);
// initialize with random elements
for(size_t i=0 ; i<N ; ++i)
{
a[i] = rand() % 1000;
b[i] = a[i];
}
// Sort the two different variants and time them
cout << N << " "
<< time_slower_sort(a) << "\t\t"
<< time_faster_sort(b) << endl;
// Sanity check
for(size_t i=0 ; i<=N-2 ; ++i)
{
assert(a[i] == b[i]); // both should return the same answer
assert(a[i] >= a[i+1]); // else not sorted
}
}
return 0;
}
I think the problem is really more subtle that one expected. In your O(N^2) solution you are making no allocation, the algorithm work in place, search the biggest and swap with the current position. This is OK.
But in the priority_queue
version O(N log N) (the priority_queue
in the internal, have a std::vector
by default, to storage the state). This vector
when you push_back
element by element sometimes it need to grow (and it does) but this is time you are no losing in the O(N^2) version. If you make the following little change in the initialization of the priority_queue
:
priority_queue<int> pq(a.begin(), a.end());
instead of the for loop
The time of the O(N log N) beat the O(N^2) as it should by a fair amount. In the proposed change there is still allocation in the priority_queue
version, but is only one time (you are saving a lot of allocation for big vector
sizes, and allocation is one of the important time consuming operation) and maybe the initialization (in O(N) could take advantage of the complete state of the priority_queue
, don't know if the STL
really does this).
Sample Code (for compile and run):
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <algorithm>
#include <vector>
#include <queue>
#include <Windows.h>
#include <assert.h>
using namespace std;
double time_slower_sort(vector<int>& a) {
LARGE_INTEGER frequency, start, end;
if (::QueryPerformanceFrequency(&frequency) == FALSE)
exit(0);
if (::QueryPerformanceCounter(&start) == FALSE)
exit(0);
for (size_t i = 0; i < a.size(); ++i) {
vector<int>::iterator it = max_element(a.begin() + i, a.end());
int max_value = *it;
*it = a[i];
a[i] = max_value;
}
if (::QueryPerformanceCounter(&end) == FALSE)
exit(0);
return static_cast<double>(end.QuadPart - start.QuadPart) /
frequency.QuadPart;
}
double time_faster_sort(vector<int>& a) {
LARGE_INTEGER frequency, start, end;
if (::QueryPerformanceFrequency(&frequency) == FALSE)
exit(0);
if (::QueryPerformanceCounter(&start) == FALSE)
exit(0);
// Push into the priority queue. Logarithmic cost per insertion = > O (n
// log(n)) total insertion cost
priority_queue<int> pq(a.begin(), a.end()); // <----- THE ONLY CHANGE IS HERE
// Read of the elements from the priority queue in order of priority
// logarithmic reading cost per read => O(n log(n)) reading cost for entire
// vector
for (size_t i = 0; i < a.size(); ++i) {
a[i] = pq.top();
pq.pop();
}
if (::QueryPerformanceCounter(&end) == FALSE)
exit(0);
return static_cast<double>(end.QuadPart - start.QuadPart) /
frequency.QuadPart;
}
int main(int argc, char** argv) {
// Iterate over vectors of different sizes and try out the two different
// variants
for (size_t N = 1000; N <= 10000; N += 100) {
// initialize two vectors with identical random elements
vector<int> a(N), b(N);
// initialize with random elements
for (size_t i = 0; i < N; ++i) {
a[i] = rand() % 1000;
b[i] = a[i];
}
// Sort the two different variants and time them
cout << N << " " << time_slower_sort(a) << "\t\t"
<< time_faster_sort(b) << endl;
// Sanity check
for (size_t i = 0; i <= N - 2; ++i) {
assert(a[i] == b[i]); // both should return the same answer
assert(a[i] >= a[i + 1]); // else not sorted
}
}
return 0;
}
In my PC (Core 2 Duo 6300) the output obtained is:
1100 0.000753738 0.000110263
1200 0.000883201 0.000115749
1300 0.00103077 0.000124526
1400 0.00126994 0.000250698
...
9500 0.0497966 0.00114377
9600 0.051173 0.00123429
9700 0.052551 0.00115804
9800 0.0533245 0.00117614
9900 0.0555007 0.00119205
10000 0.0552341 0.00120466
the 3rd column is larger than the second for the array sizes considered.
The "Big O" notation only tells you how the time grows with the input size.
Your times are (or should be)
A + B*N^2 for the quadratic case,
C + D*N*LOG(N) for the linearithmic case.
But it is entirely possible that C is much larger than A, leading to a higher execution time for the linearithmic code when N is small enough.
What makes linearithmicity interesting is that if your input grew from 9600 to 19200 (doubling), your execution time should approximately quadruple, going to about eight seconds, for the quadratic algorithm, while the linearithmic algorithm should little more than double its execution time.
So the execution time ratio would go from 2:8 to 8:16, i.e. the quadratic algorithm is now only twice as fast.
Double again the size of the input and 8:16 becomes 32:32; the two algorithms are equally fast when confronted with an input of around 40,000.
When tackling an input size of 80,000, the ratio is reversed: four times 32 is 128, while twice 32 is only 64. 128:64 means that the linearithmic algorithm is now twice as fast as the other.
You should run tests with very different sizes, maybe N, 2*N and 4*N, in order to get a better estimate of your A, B, C and D constants.
What this all boils down to is, do not blindly rely on the Big O classification. Use it if you expect your input to grow large; but for small inputs, it may well be that a less scalable algorithm turns out to be more efficient.
For example
you see that for small input sizes, the faster algorithm is the one running in exponential time, which is hundreds of times faster than the logarithmic one. But as soon as the input size grows past nine, the exponential algorithm's running time skyrockets, while the other doesn't.You might even decide to implement both versions of an algorithm, and use the one or the other depending on input size. There are some recursive algorithms that do exactly this, and switch to iterative implementations for the last iterations. In the pictured case, you could implement the best algorithm for each size range; but the best compromise is go with two algorithms only, the quadratic one until N=15, and switch to logarithmic afterwards.
I found here a reference to Introsort, which
is a sorting algorithm that initially uses Quicksort, but switches to Heapsort when the recursion depth exceeds a level based on the logarithm of the number of elements being sorted, and uses Insertion sort for small cases because of its good locality of reference, i.e. when the data most likely resides in memory and easily referenced.
In the above case, the Insertion sort exploits memory locality, which means that its B constant is very small; a recursive algorithm is likely to incur higher costs, and have a significant value of C. So for small datasets, more compact algorithms do well even if their Big O classification is poorer.
You have a O (N^2) algorithm running 4 times faster than a O (N log N) algorithm. Or at least you think you do.
The obvious thing to do would be to verify your assumption. There's not much that you can conclude from sizes 9600, 9800 and 10000. Try sizes 1000, 2000, 4000, 8000, 16000, 32000. Does the first algorithm increase the time by a factor 4 each time as it should? Does the second algorithm increase the time by a factor slightly larger than 2 each time as it should?
If yes, then the O (N^2) and O (N log N) look right but the second one has massive constant factors. If no, then your assumption about the execution speeds are wrong, and you start investigating why. O (N log N) taking 4 times longer than O (N * N) at N = 10,000 would be highly unusual and looks very suspicious.
Visual studio must have an extreme overhead, for non optimized / debug-level std::
code, in particular the priority queue class. Check out @msandifords comment.
I tested your program with g++, first no optimizations.
9800 1.42229 0.014159
9900 1.45233 0.014341
10000 1.48106 0.014606
Notice my vector times are close to yours. The priority queue times on the other hand are magnitudes smaller. This would suggest a debug-friendly and really slow implementation of the priority queue, and as such contribute in a huge part to the constant mentioned by hot-licks on the comments.
Then with -O3, full optimization (Close to your Release mode).
1000 0.000837 7.4e-05
9800 0.077041 0.000754
9900 0.078601 0.000762
10000 0.080205 0.000771
Now to see if this is reasonable you can use the simple formula for the complexity.
time = k * N * N; // 0.0008s
k = 8E-10
Calculate for N = 10000
time = k * 10000 * 10000 // Which conveniently gives
time = 0.08
A perfectly cromulent result, consistent with O(N²) and a good implementation. The same can of course be made for the O(NlogN) part.
来源:https://stackoverflow.com/questions/25613533/why-is-my-n-logn-heapsort-slower-than-my-n2-selection-sort