Inserting multiple not-a-numbers into a std::unordered_set

问题

One of consequences of the IEEE 754 standard is the non-intuitive behavior of std::unordered_set<double>, when not-a-number elements (NANs) are inserted.

Due to the fact that NAN!=NAN, after the following sequence:

#include <iostream>
#include <cmath>
#include <unordered_set>

int main(){
    std::unordered_set<double> set;
    set.insert(NAN);
    set.insert(NAN);
    std::cout<<"Number of elements "<<set.size()<<"\n";  //there are 2 elements!
}

there are two elements in the set(see it live): NAN and NAN!

Mine main issue with this is, that when N NANs are inserted into the hash-set, they all hit the same hash-bucket and the performance of N inserts into the hash-set degenerates to the worst-case running time - O(N^2).

For an example, see the listing at the end of the question or here live - inserting NAN takes some order of magnitude more time than a "normal" floating number.

My question: is it possible (and if yes - how) to tweak std::unordered_set<double> in such a way, that there is at most one NAN-element in the set, no matter the flavor of inserted NANs (NAN, -NAN and so on)?

Listing:

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

constexpr int N=5000;
void test_insert(double value)
{
    std::unordered_set<double> s;
    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);           //takes 0.2 s
}

回答1:

From your comment in Andrews answer,

I think the problem with this solution: -NAN will have a different hash-value then NAN but for hash-function h must hold: if x==y then also h(x)==h(y)

This does hash differently, so you need to also define your own hash function if you want h(-NAN) == h(NAN) ...

(augmented from @Andrew answer)

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

struct EqualPred
{
    constexpr bool operator()(const double& lhs, const double& rhs) const
    {
        if (std::isnan(lhs) && std::isnan(rhs)) return true;
        return lhs == rhs;
    }
};

template <typename T>
struct Hash
{
    size_t operator()(const T& value) const
    {
        return  std::hash<T>()( std::isnan(value) ? NAN : value);
    }


};

std::unordered_set<double, Hash<double>, EqualPred> s;

constexpr int N=5000;
void test_insert(double value)
{

    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);  
    test_insert(-NAN);

    std::cout << s.size() << std::endl;
    //takes 0.2 s
}

Demo

回答2:

You can define your own predicate to compare the keys, and ensure NaNs compare equal for this purpose. This can be supplied as the third parameter to the std::unordered_set class template.

See the definition of EqualPred below (code copied from question and modified), and its use in declaring the unordered_set variable. I took the default for the second parameter from the documentation at https://en.cppreference.com/w/cpp/container/unordered_set

Live demo: http://coliru.stacked-crooked.com/a/7085936431e6698f

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

struct EqualPred
{
    constexpr bool operator()(const double& lhs, const double& rhs) const
    {
        if (std::isnan(lhs) && std::isnan(rhs)) return true;
        return lhs == rhs;
    }
};

constexpr int N=5000;
void test_insert(double value)
{
    std::unordered_set<double, std::hash<double>, EqualPred> s;
    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);           //takes 0.2 s
}

It is worth noting (thanks to @ead's comment) that -NaN and +NaN may hash to different values. If you want to handle these as identical, you'll need to provide a different implementation of the second template parameter, the hash function. This should detect any NaNs and hash the same NaN each time.

来源：https://stackoverflow.com/questions/51042495/inserting-multiple-not-a-numbers-into-a-stdunordered-setdouble

标签

c++

c++11

ieee-754

Inserting multiple not-a-numbers into a std::unordered_set<double>

问题

回答1:

回答2: