What are efficient ways to sort arrays that have mostly a small set of duplicated elements? That is, a list like:
{ 10, 10, 55, 10, 999, 8851243, 10, 55, 55, 55, 10
Implementation in C++ based on algo as suggested by @Antti Huima
#include <unordered_map>
#include <map>
// Modifies input array to a sorted array
// Complexity: O(n+(k*log(k))) where 'k' = number of unique elements input array
template <typename Datatype>
void SortArrayWithDuplicates(std::vector<Datatype>& in_seq) {
std::unordered_map<Datatype, int> key_counts_map;
// Count freqs O(n)
for (const auto& itr: in_seq)
key_counts_map[itr] += 1;
// Sort elements by inserting into a map O(k*log(k))
std::map<Datatype, int> key_counts_sorted_map;
for (auto const& itr: key_counts_map)
key_counts_sorted_map.insert(std::make_pair(itr.first, itr.second));
auto AlwaysTrue = [](Datatype i)->bool{return true;};
auto seq_itr = std::begin(in_seq);
// Update input sequence with new sorted values
for (auto const& itr: key_counts_sorted_map) {
std::replace_if(seq_itr, seq_itr+itr.second, AlwaysTrue, itr.first);
seq_itr += itr.second;
}
}
In practice, you can first iterate through the array once and use a hash table the count the number of occurrences of the individual elements (this is O(n) where n = size of the list). Then take all the unique elements and sort them (this is O(k log k) where k = number of unique elements), and then expand this back to a list of n elements in O(n) steps, recovering the counts from the hash table. If k << n you save time.
IMO Pidgeonhole sort is a good example for such data.
I will clarify a bit: if you know that quantity of the unique elements in the array is reasonable and you know that there are a lot of duplicates I would think of implementing something like counting sort but make list of "buckets" dynamic. After the first pass you will get rid of duplicates, then sort array without duplicates with some good sorting algorithm and then restore sorted array in a way like counting sort does.
I would try Counting sort with some mapping function. Ie. you wont use the frequencies array of size equal to the range of elements, instead you would iterate over the array, write down distinct elements and use them in a mapping function to the array of frequencies.
This way the algorithm has only one extra iteration and a mapping function, which should work in a constant time (using some kind of hash table). The complexity of this approach would be O(n)
, which should be optimal.
Not the best algorithm, but simple:
You can put everything in a trie, and have the leaves be counters. That should take O(n*m) where n is the number of elements and m is the size of the biggest element (typically that would be a constant, but not necessarily). Then pre-order traverse the tie, outputting counter
elements of the current key when you hit a leaf. That should take only O(n+p) where p is the size of the trie, which should be tiny compared to n.