Given an STL vector, output only the duplicates in sorted order, e.g.,
INPUT : { 4, 4, 1, 2, 3, 2, 3 }
OUTPUT: { 2, 3, 4 }
The algorithm is
This fixes the bugs in James McNellis's original version. I also provide in-place and out-of-place versions.
// In-place version. Uses less memory and works for more container
// types but is slower.
template <typename It>
It not_unique_inplace(It first, It last)
{
if (first == last)
return last;
It new_last = first;
for (It current = first, next = first + 1; next != last; ++current, ++next)
{
if (*current == *next &&
(new_last == first || *current != *(new_last-1)))
*new_last++ = *current;
}
return new_last;
}
// Out-of-place version. Fastest.
template <typename It, typename Container>
void not_unique(It first, It last, Container pout)
{
if (first == last || !pout)
return;
for (It current = first, next = first + 1; next != last; ++current, ++next)
{
if (*current == *next &&
(pout->empty() || *current != pout->back()))
pout->push_back(*current);
}
}
Another one:
template <typename T>
void keep_duplicates(vector<T>& v)
{
set<T>
u(v.begin(), v.end()), // unique
d; // duplicates
for (size_t i = 0; i < v.size(); i++)
if (u.find(v[i]) != u.end())
u.erase(v[i]);
else
d.insert(v[i]);
v = vector<T>(d.begin(), d.end());
}
Calling "erase(it_start + keep, it_stop);" from within the while loop is going to result in copying the remaining elements over and over again.
I'd suggest swapping all unique elements to the front of the vector, then erasing the remaining elements all at once.
int num_repeats(vector<int>::const_iterator curr, vector<int>::const_iterator end) {
int same = *curr;
int count = 0;
while (curr != end && same == *curr) {
++curr;
++count;
}
return count;
}
void dups(vector<int> *v) {
sort(v->begin(), v->end());
vector<int>::iterator current = v->begin();
vector<int>::iterator end_of_dups = v->begin();
while (current != v->end()) {
int n = num_repeats(current, v->end());
if (n > 1) {
swap(*end_of_dups, *current);
end_of_dups++;
}
current += n;
}
v->erase(end_of_dups, v->end());
}
What is meant by "as efficient as std::unique"? Efficient in terms of runtime, development time, memory usage, or what?
As others pointed out, std::unique requires sorted input, which you haven't provided, so it's not a fair test to begin with.
Personally I would just have a std::map do all of my work for me. It has a lot of properties we can use for maximal elegance/brevity. It keeps its elements sorted already, and operator[] will insert a zero value if the key doesn't already exist. By leveraging those properties, we can get this done in two or three lines of code, and still achieve reasonable runtime complexity.
Basically, my algorithm is this: For each element in the vector, increment by one the map entry keyed by the value of that element. Afterwards, simply walk the map, outputting any key whose value is more than 1. Couldn't be simpler.
#include <iostream>
#include <vector>
#include <map>
void
output_sorted_duplicates(std::vector<int>* v)
{
std::map<int, int> m;
// count how many of each element there are, putting results into map
// map keys are elements in the vector,
// map values are the frequency of that element
for (std::vector<int>::iterator vb = v->begin(); vb != v->end(); ++vb)
++m[*vb];
// output keys whose values are 2 or more
// the keys are already sorted by the map
for (std::map<int, int>::iterator mb = m.begin(); mb != m.end(); ++mb)
if ( (*mb).second >= 2 )
std::cout << (*mb).first << " ";
std::cout << std::endl;
}
int main(void)
{
int initializer[] = { 4, 4, 1, 2, 3, 2, 3 };
std::vector<int> data(&initializer[0], &initializer[0] + 7);
output_sorted_duplicates(&data);
}
janks@phoenix:/tmp$ g++ test.cc && ./a.out
2 3 4
So, we visit each element in your vector once, and then each element in my map once, where the number of elements in my map is at worst no bigger than your vector. The drawbacks to my solution are a lot more storage space than the solutions that involve rearranging your vector in-place. The advantages, however, are clear. It's incredibly short and simple, it's obviously correct without the need for much testing or code review, and it has reasonable performance properties.
Making my function a template, and making it operate on STL-style ranges instead of just vectors of ints, is left as an exercise.