Is gcc std::unordered_map implementation slow? If so - why?

后端 未结 3 1045
旧时难觅i
旧时难觅i 2020-12-02 08:31

We are developing a highly performance critical software in C++. There we need a concurrent hash map and implemented one. So we wrote a benchmark to figure out, how much slo

相关标签:
3条回答
  • 2020-12-02 09:08

    I found the reason: it is a Problem of gcc-4.7!!

    With gcc-4.7

    inserts: 37728
    get    : 2985
    

    With gcc-4.6

    inserts: 2531
    get    : 1565
    

    So std::unordered_map in gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing).

    I will submit a bug report.. until then: DO NOT use std::unordered_map with gcc 4.7!

    0 讨论(0)
  • 2020-12-02 09:26

    I have run your code using a 64 bit / AMD / 4 cores (2.1GHz) computer and it gave me the following results:

    MinGW-W64 4.9.2:

    Using std::unordered_map:

    inserts: 9280 
    get: 3302
    

    Using std::map:

    inserts: 23946
    get: 24824
    

    VC 2015 with all the optimization flags I know:

    Using std::unordered_map:

    inserts: 7289
    get: 1908
    

    Using std::map:

    inserts: 19222 
    get: 19711
    

    I have not tested the code using GCC but I think it may be comparable to the performance of VC, so if that is true, then GCC 4.9 std::unordered_map it's still broken.

    [EDIT]

    So yes, as someone said in the comments, there is no reason to think that the performance of GCC 4.9.x would be comparable to VC performance. When I have the change I will be testing the code on GCC.

    My answer is just to establish some kind of knowledge base to other answers.

    0 讨论(0)
  • 2020-12-02 09:28

    I am guessing that you have not properly sized your unordered_map, as Ylisar suggested. When chains grow too long in unordered_map, the g++ implementation will automatically rehash to a larger hash table, and this would be a big drag on performance. If I remember correctly, unordered_map defaults to (smallest prime larger than) 100.

    I didn't have chrono on my system, so I timed with times().

    template <typename TEST>
    void time_test (TEST t, const char *m) {
        struct tms start;
        struct tms finish;
        long ticks_per_second;
    
        times(&start);
        t();
        times(&finish);
        ticks_per_second = sysconf(_SC_CLK_TCK);
        std::cout << "elapsed: "
                  << ((finish.tms_utime - start.tms_utime
                       + finish.tms_stime - start.tms_stime)
                      / (1.0 * ticks_per_second))
                  << " " << m << std::endl;
    }
    

    I used a SIZE of 10000000, and had to change things a bit for my version of boost. Also note, I pre-sized the hash table to match SIZE/DEPTH, where DEPTH is an estimate of the length of the bucket chain due to hash collisions.

    Edit: Howard points out to me in comments that the max load factor for unordered_map is 1. So, the DEPTH controls how many times the code will rehash.

    #define SIZE 10000000
    #define DEPTH 3
    std::vector<uint64_t> vec(SIZE);
    boost::mt19937 rng;
    boost::uniform_int<uint64_t> dist(std::numeric_limits<uint64_t>::min(),
                                      std::numeric_limits<uint64_t>::max());
    std::unordered_map<int, long double> map(SIZE/DEPTH);
    
    void
    test_insert () {
        for (int i = 0; i < SIZE; ++i) {
            map[vec[i]] = 0.0;
        }
    }
    
    void
    test_get () {
        long double val;
        for (int i = 0; i < SIZE; ++i) {
            val = map[vec[i]];
        }
    }
    
    int main () {
        for (int i = 0; i < SIZE; ++i) {
            uint64_t val = 0;
            while (val == 0) {
                val = dist(rng);
            }
            vec[i] = val;
        }
        time_test(test_insert, "inserts");
        std::random_shuffle(vec.begin(), vec.end());
        time_test(test_insert, "get");
    }
    

    Edit:

    I modified the code so that I could change out DEPTH more easily.

    #ifndef DEPTH
    #define DEPTH 10000000
    #endif
    

    So, by default, the worst size for the hash table is chosen.

    elapsed: 7.12 inserts, elapsed: 2.32 get, -DDEPTH=10000000
    elapsed: 6.99 inserts, elapsed: 2.58 get, -DDEPTH=1000000
    elapsed: 8.94 inserts, elapsed: 2.18 get, -DDEPTH=100000
    elapsed: 5.23 inserts, elapsed: 2.41 get, -DDEPTH=10000
    elapsed: 5.35 inserts, elapsed: 2.55 get, -DDEPTH=1000
    elapsed: 6.29 inserts, elapsed: 2.05 get, -DDEPTH=100
    elapsed: 6.76 inserts, elapsed: 2.03 get, -DDEPTH=10
    elapsed: 2.86 inserts, elapsed: 2.29 get, -DDEPTH=1
    

    My conclusion is that there is not much significant performance difference for any initial hash table size other than making it equal to the entire expected number of unique insertions. Also, I don't see the order of magnitude performance difference that you are observing.

    0 讨论(0)
提交回复
热议问题