In my application I have the following requirements -
The data structure will be populated just once with some values (not key/value pairs). The values may be r
If building of the data-structure does not factor into the performance-concerns (or at least only marginally), consider saving your data into a std::vector
: There's nothing beating it.
For speeding up the initial building of the data-structure, you might first insert into a std::unordered_set
or at least use one for checking existence before insertion.
In the second case it need not contain the elements, but could contain e.g. indices.
std::vector<T> v;
auto h = [&v](size_t i){return std::hash<T>()(v[i]);};
auto c = [&v](size_t a, size_t b){return v[a] == v[b];};
std::unordered_set<size_t, decltype(h), decltype(c)> tester(0, h, c);
There are several approaches.
std::unordered_set
that has the fastest O(1)
lookup/insertion and O(N)
iteration (as has every container). If you have data that changes a lot, or requires a lot of random lookups, this is probably the fastest. But test.O(N)
copy to a std::vector
and gain from contiguous memory layout 100s of times. Test whether this is faster than a regular std::unordered_set
. boost::flat_set
which offers a std::set
interface with a std::vector
storage back-end (i.e. a contiguous memory layout that is very cache- and prefetch friendly). Again, test whether this gives a speed-up to the other two solutions.For the last solution, see the Boost documentation for some of the tradeoffs (it's good to be aware of all the other issues like iterator invalidation, move semantics and exception safety as well):
Boost.Container flat_[multi]map/set containers are ordered-vector based associative containers based on Austern's and Alexandrescu's guidelines. These ordered vector containers have also benefited recently with the addition of move semantics to C++, speeding up insertion and erasure times considerably. Flat associative containers have the following attributes:
- Faster lookup than standard associative containers
- Much faster iteration than standard associative containers
- Less memory consumption for small objects (and for big objects if shrink_to_fit is used)
- Improved cache performance (data is stored in contiguous memory)
- Non-stable iterators (iterators are invalidated when inserting and erasing elements)
- Non-copyable and non-movable values types can't be stored
- Weaker exception safety than standard associative containers (copy/move constructors can throw when shifting values in erasures and insertions)
- Slower insertion and erasure than standard associative containers (specially for non-movable types)
NOTE: with faster lookup, it is meant that a flat_set
does O(log N)
on contiguous memory rather than O(log N)
pointer chasing of a regular std::set
. Of course, a std::unordered_set
does O(1)
lookup, which will faster for large N
.
I'd suggest you to use either set or unordered_set for "filtration" and when you are done, move data to vector of fixed size
I highly recommend you not to use in such this case. set
is binary tree, and unordered_set
is hash table - so they use lots of memory, and have slow iteration speed and bad locality of reference. If you have to insert/remove/find data frequently, set
or unordered_set
good choice, but now you need to just read, store, sort data once and only use data many times.
In this case, sorted vector can be such a good choice. vector
is dynamic array, so it has low overhead.
Just directly, see the code.
std::vector<int> data;
int input;
for (int i = 0; i < 10; i++)
{
std::cin >> input;
data.push_back(input); // store data
}
std::sort(data.begin(), data.end()); // sort data
That's all. All your data is ready.
If you need to remove duplicates like set
, just use unique
- erase
after sorting.
data.erase(
std::unique(data.begin(), data.end()),
data.end()
);
Notice that you should use lower_bound
, upper_bound
and equal_range
rather than find
or find_if
to use the benefits of sorted data.
An unordered-set uses a hash-table to provide near O(1) time searching. This is done by using a hash of the key to calculate the offset of the element-you-are-seeking (keys) from the beginning of the dataset. Unless your dataset is small (like char
s) different keys may have the same hash (a collision).
To minimize collisions a unordered-set will have to keep the data-store fairly sparsely populated. This means that finding a key will most be O(1) time (unless there is a collision).
However when iterating through a hash-table our iterator will encounter a lot of unused space in our datastore which will slow down the finding of the next element by our iterator. We could link adjacent elements in the hash-table with extra pointers but I do not think an unordered-set does so.
In light of the above, I will suggest you use a sorted vector for your "set". Using bisections you can search the store in O(log n) time and iterating through the list is trivial. A vector has the added advantage that the memory is contiguous so you are less likely to experience cache misses.