I have a single line of code, that consumes 25% - 30% of the runtime of my application. It is a less-than comparator for an std::set (the set is implemented with a Red-Black-Tre
I don't have an answer per se - just a couple of ideas:
I have many insertions for each extraction of the minimum. I thought about using Fibonacci-Heaps, but I have been told that they are theoretically nice, but suffer from high constants and are pretty complicated to implement. And since insert is in O(log(n)) the runtime increase is nearly constant with large n. So I think its okay to stick to the set.
This sounds to me like a typical priority-queue application. You say you just considered using a Fibonacci heap, so I guess such a priority-queue implementation would be sufficient for your needs (pushing elements, and extracting the min element one at a time). Before you go out of your way and obsess on optimizing one or two clock cycles out of that comparison function, I would suggest that you try a few off-the-shelf priority-queue implementations. Like std::priority_queue
, boost::d_ary_heap
(or boost::d_ary_heap_indirect for a mutable priority-queue), or any other boost heap structure.
I encountered a similar situation before, I was using a std::set
in place of a priority-queue in a A*-like algorithm (and also tried a sorted std::vector
with std::inplace_merge
for insertions), and switching to std::priority_queue
was a huge boost in performance, and then later switching to boost::d_ary_heap_indirect
went the extra mile. I recommend that you at least give that a try if you haven't already.
I have a hard time believing that:
a) The comparison function runs 180 million times in 30 seconds
and
b) The comparison function uses 25% of the cpu time
are both true. Even a Core 2 Duo should easily be able to run 180 million comparisons in less than a second (after all, the claim is that it can do something like 12,000 MIPS, if that actually means anything). So I'm inclined to believe that there is something else being lumped in with the comparison by the profiling software. (Allocating memory for new elements, for example.)
However, you should at least consider the possibility that a std::set is not the data structure you're looking for. If you do millions of inserts before you actually need the sorted values (or maximum value, even), then you may well be better off putting the values into a vector, which is a much cheaper data structure both in time and space, and sorting it on demand.
If you actually need the set because you're worried about collisions, then you might consider an unordered_set instead, which is slight cheaper but not as cheap as a vector. (Precisely because vectors cannot guarantee you uniqueness.) But honestly, looking at that structure definition, I have a hard time believing that uniqueness is important to you.
"Benchmark"
On my little Core i5 laptop, which I suppose is not in the same league as OP's machine, I ran a few tests inserting 10 million random unique Entry's (with just the two comparison fields) into a std::set and into a std::vector. At the end of this, I sort the vector.
I did this twice; once with a random generator that produces probably unique costs, and once with a generator which produces exactly two different costs (which should make the compare slower). Ten million inserts results in slightly more comparisons than reported by OP.
unique cost discrete cost
compares time compares time
set 243002508 14.7s 241042920 15.6s
vector 301036818 2.0s 302225452 2.3s
In an attempt to further isolate the comparison times, I redid the vector benchmarks using both std::sort and std::partial_sort, using 10 elements (essentially a selection of top-10) and 10% of the elements (that is, one million). The results of the larger partial_sort surprised me -- who would have thought that sorting 10% of a vector would be slower than sorting all of it -- but they show that algorithm costs are a lot more significant than comparison costs:
unique cost discrete cost
compares time compares time
partial sort 10 10000598 0.6s 10000619 1.1s
partial sort 1M 77517081 2.3s 77567396 2.7s
full sort 301036818 2.0s 302225452 2.3s
Conclusion: The longer compare time is visible, but container manipulation dominates. The total cost of ten million set inserts is certainly visible in a total of 52 seconds of compute time. The total cost of ten million vector inserts is quite a bit less noticeable.
Small note, for what it's worth:
The one thing I got from that bit of assembly code is that you're not saving anything by making the cost a float
. It's actually allocating eight bytes for the float, so you're not saving any memory, and your cpu does not do a single float comparison any faster than a single double comparison. Just sayin' (i.e., beware of premature optimization).
Downvoter, care to explain?
Let me preface this with the fact that what I'm going to outline here is fragile and not entirely portable -- but under the right circumstances (which are pretty much what you've specified) I'm reasonably certain that it should work correctly.
One point it depends upon is the fact that IEEE floating point numbers are carefully designed so that if you treat their bit pattern as an integer, they'll still sort into the correct order (modulo a few things like NaNs, for which there really is no "correct order").
To make use of that, what we do is pack the Entry so there's no padding between the two pieces that make up our key. Then we ensure the structure as a whole is aligned to an 8-byte boundary. I've also changed the _id
to int32_t
to ensure that it stays 32 bits, even on a 64-bit system/compiler (which will almost certainly produce the best code for this comparison).
Then, we cast the address of the structure so we can view the floating point number and the integer together as a single 64-bit integer. Since you're using a little-endian processor, to support that we need to put the less significant part (the id
) first, and the more significant part (the cost
) second, so when we treat them as a 64-bit integer, the floating point part will become the most significant bits, and the integer part the less significant bits:
struct __attribute__ ((__packed__)) __attribute__((aligned(8)) Entry {
// Do *not* reorder the following two fields or comparison will break.
const int32_t _id;
const float _cost;
// some other vars
Entry(long id, float cost) : _cost(cost), _id(id) {}
};
Then we have our ugly little comparison function:
bool operator<(Entry const &a, Entry const &b) {
return *(int64_t const *)&a < *(int64_t const *)&b;
}
Once we've defined the struct correctly, the comparison becomes fairly straightforward: just take the first 64 bits of each struct, and compare them as if they were 64-bit integers.
Finally a bit of test code to give at least a little assurance that it works correctly for some values:
int main() {
Entry a(1236, 1.234f), b(1234, 1.235f), c(1235, 1.235f);
std::cout << std::boolalpha;
std::cout << (b<a) << "\n";
std::cout << (a<b) << "\n";
std::cout << (b<c) << "\n";
std::cout << (c<b) << "\n";
return 0;
}
At least for me, that produces the expected results:
false
true
true
false
Now, some of the possible problems: if the two items get rearranged either between themselves, or any other part of the struct gets put before or between them, comparison will definitely break. Second, we're completely dependent on the sizes of the items remaining 32 bits apiece, so when they're concatenated they'll be 64 bits. Third, if somebody removes the __packed__
attribute from the struct definition, we could end up with padding between _id
and _cost
, again breaking the comparison. Likewise, if somebody removes the aligned(8), the code may lose some speed, because it's trying to load 8-byte quantities that aren't aligned to 8-byte boundaries (and on another processor, this might fail completely). [Edit: Oops. @rici reminded me of something I intended to list here, but forgot: this only works correctly when both the _id
and cost
are positive. If _cost
is negative, comparisons will be messed up by the fact that IEEE floating point used a signed magnitude representation. If an _id
is negative, its sign bit will be treated just like a normal bit in the middle of a number, so a negative _id
will show up as larger than a positive _id
.]
To summarize: this is fragile. No question at all about that. Nonetheless, it should be pretty fast -- especially if you're using a 64-bit compiler, in which case I'd expect the comparison to come out to two loads and one comparison. To make a long story short, you're at the point that you probably can't make the comparison itself any faster at all -- all you can do is try to do more in parallel, optimize memory usage patterns, etc.
An easy solution is precompute a sort-identifier comprised of the cost as most significant and the id as the rest.
E.g.,
struct Entry
{
double cost_;
long id_;
long long sortingId_;
// some other vars
Entry( double cost, float id )
: cost_( cost ), id_( id ), sortingId_( 1e9*100*cost + id )
{}
};
Adjust sortingId_
value based on what you can assume about the value ranges.
Then, now just sort on sortingId_
.
Or as variation of the same idea, if you can't make suitable assumptions about the data, then consider preparing data especially for memcmp
.
For a higher level solution, remember that std::set::insert
has an overload with a hint argument. If your data is already in near sorted order, that might seriously reduce the number of calls to your comparator function.
And you might consider whether a std::unordered_set
might be sufficient? I.e. whether you need to list the data in sorted order. Or if the sorting is just the internal stuff of std::set
element insertion.
Finally, for other readers (the OP has made clear that he's aware of this), remember to MEASURE.