We are developing a highly performance critical software in C++. There we need a concurrent hash map and implemented one. So we wrote a benchmark to figure out, how much slo
I found the reason: it is a Problem of gcc-4.7!!
With gcc-4.7
inserts: 37728
get : 2985
With gcc-4.6
inserts: 2531
get : 1565
So std::unordered_map
in gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing).
I will submit a bug report.. until then: DO NOT use std::unordered_map
with gcc 4.7!
I have run your code using a 64 bit / AMD / 4 cores (2.1GHz) computer and it gave me the following results:
MinGW-W64 4.9.2:
Using std::unordered_map:
inserts: 9280
get: 3302
Using std::map:
inserts: 23946
get: 24824
VC 2015 with all the optimization flags I know:
Using std::unordered_map:
inserts: 7289
get: 1908
Using std::map:
inserts: 19222
get: 19711
I have not tested the code using GCC but I think it may be comparable to the performance of VC, so if that is true, then GCC 4.9 std::unordered_map it's still broken.
[EDIT]
So yes, as someone said in the comments, there is no reason to think that the performance of GCC 4.9.x would be comparable to VC performance. When I have the change I will be testing the code on GCC.
My answer is just to establish some kind of knowledge base to other answers.
I am guessing that you have not properly sized your unordered_map
, as Ylisar suggested. When chains grow too long in unordered_map
, the g++ implementation will automatically rehash to a larger hash table, and this would be a big drag on performance. If I remember correctly, unordered_map
defaults to (smallest prime larger than) 100
.
I didn't have chrono
on my system, so I timed with times()
.
template <typename TEST>
void time_test (TEST t, const char *m) {
struct tms start;
struct tms finish;
long ticks_per_second;
times(&start);
t();
times(&finish);
ticks_per_second = sysconf(_SC_CLK_TCK);
std::cout << "elapsed: "
<< ((finish.tms_utime - start.tms_utime
+ finish.tms_stime - start.tms_stime)
/ (1.0 * ticks_per_second))
<< " " << m << std::endl;
}
I used a SIZE
of 10000000
, and had to change things a bit for my version of boost
. Also note, I pre-sized the hash table to match SIZE/DEPTH
, where DEPTH
is an estimate of the length of the bucket chain due to hash collisions.
Edit: Howard points out to me in comments that the max load factor for unordered_map
is 1
. So, the DEPTH
controls how many times the code will rehash.
#define SIZE 10000000
#define DEPTH 3
std::vector<uint64_t> vec(SIZE);
boost::mt19937 rng;
boost::uniform_int<uint64_t> dist(std::numeric_limits<uint64_t>::min(),
std::numeric_limits<uint64_t>::max());
std::unordered_map<int, long double> map(SIZE/DEPTH);
void
test_insert () {
for (int i = 0; i < SIZE; ++i) {
map[vec[i]] = 0.0;
}
}
void
test_get () {
long double val;
for (int i = 0; i < SIZE; ++i) {
val = map[vec[i]];
}
}
int main () {
for (int i = 0; i < SIZE; ++i) {
uint64_t val = 0;
while (val == 0) {
val = dist(rng);
}
vec[i] = val;
}
time_test(test_insert, "inserts");
std::random_shuffle(vec.begin(), vec.end());
time_test(test_insert, "get");
}
Edit:
I modified the code so that I could change out DEPTH
more easily.
#ifndef DEPTH
#define DEPTH 10000000
#endif
So, by default, the worst size for the hash table is chosen.
elapsed: 7.12 inserts, elapsed: 2.32 get, -DDEPTH=10000000
elapsed: 6.99 inserts, elapsed: 2.58 get, -DDEPTH=1000000
elapsed: 8.94 inserts, elapsed: 2.18 get, -DDEPTH=100000
elapsed: 5.23 inserts, elapsed: 2.41 get, -DDEPTH=10000
elapsed: 5.35 inserts, elapsed: 2.55 get, -DDEPTH=1000
elapsed: 6.29 inserts, elapsed: 2.05 get, -DDEPTH=100
elapsed: 6.76 inserts, elapsed: 2.03 get, -DDEPTH=10
elapsed: 2.86 inserts, elapsed: 2.29 get, -DDEPTH=1
My conclusion is that there is not much significant performance difference for any initial hash table size other than making it equal to the entire expected number of unique insertions. Also, I don't see the order of magnitude performance difference that you are observing.