How to keep only duplicates efficiently?

后端 未结 10 691
闹比i
闹比i 2021-01-04 06:14

Given an STL vector, output only the duplicates in sorted order, e.g.,

INPUT : { 4, 4, 1, 2, 3, 2, 3 }
OUTPUT: { 2, 3, 4 }

The algorithm is

相关标签:
10条回答
  • 2021-01-04 06:43

    I miserably failed with my first attempt, assuming that std::unique moves all the duplicates to the end of the range (it doesn't). Oops. Here's another attempt:

    Here is an implementation of not_unique. It removes any elements that appear only once in the sorted range and duplicates of any elements that appear more than once. The resulting range is therefore the unique list of elements that appear more than once.

    The function has linear complexity and does a single pass over the range (std::unique has linear complexity). It must meet the requirements of a forward iterator. The end of the resulting range is returned.

    template <typename It>
    It not_unique(It first, It last)
    {
        if (first == last) { return last; }
    
        It new_last = first;
        for (It current = first, next = ++first; next != last; ++current, ++next)
        {
            if (*current == *next)
            {
                if (current == new_last)
                {
                    ++new_last;
                }
                else
                {
                    *new_last++ = *current;
                    while (next != last && *current == *next)
                    {
                        ++current;
                        ++next;
                    }
                    if (next == last)
                        return new_last;
                }
            }
        }
        return new_last;
    }
    
    0 讨论(0)
  • 2021-01-04 06:46

    Shorter and more STL-ish than previous entries. Assumes sorted input.

    #include <algorithm>
    #include <functional>
    
    template< class I, class P >
    I remove_unique( I first, I last, P pred = P() ) {
        I dest = first;
        while (
            ( first = std::adjacent_find( first, last, pred ) )
                != last ) {
            * dest = * first;
            ++ first;
            ++ dest;
            if ( ( first = std::adjacent_find( first, last, std::not2( pred ) ) )
                == last ) break;
            ++ first;
        }
        return dest;
    }
    
    template< class I >
    I remove_unique( I first, I last ) {
        return remove_unique( first, last,
            std::equal_to< typename std::iterator_traits<I>::value_type >() );
    }
    
    0 讨论(0)
  • 2021-01-04 06:49

    You can even use mismatch, for extra points!
    Btw: nice exercise.

    template<class TIter>
    /** Moves duplicates to front, returning end of duplicates range.
     *  Use a sorted range as input. */
    TIter Duplicates(TIter begin, TIter end) {
        TIter dup = begin;
        for (TIter it = begin; it != end; ++it) {
            TIter next = it;
            ++next;
            TIter const miss = std::mismatch(next, end, it).second;
            if (miss != it) {
                *dup++ = *miss;
                it = miss;
            }
        }
        return dup;
    }
    
    0 讨论(0)
  • 2021-01-04 06:52

    My suggestion would be a modified insertion sort, so that you can sort & filter dupes at the same time.

    0 讨论(0)
  • 2021-01-04 06:58

    This is in the style of the standard library. Credit for algorithm goes to James! (If you +1 me, you better +1 him, or else). All I did was make it standard library style:

    #include <algorithm>
    #include <functional>
    #include <iostream>
    #include <iterator>
    #include <vector>
    
    // other stuff (not for you)
    template <typename T>
    void print(const char* pMsg, const T& pContainer)
    {
        std::cout << pMsg << "\n    ";
        std::copy(pContainer.begin(), pContainer.end(),
            std::ostream_iterator<typename T::value_type>(std::cout, " "));
        std::cout << std::endl;
    }
    
    template <typename T, size_t N>
    T* endof(T (&pArray)[N])
    {
        return &pArray[0] + N;
    }
    
    // not_unique functions (for you)
    template <typename ForwardIterator, typename BinaryPredicate>
    ForwardIterator not_unique(ForwardIterator pFirst, ForwardIterator pLast,
                               BinaryPredicate pPred)
    {
        // correctly handle case where an empty range was given:
        if (pFirst == pLast) 
        { 
            return pLast; 
        }
    
        ForwardIterator result = pFirst;
        ForwardIterator previous = pFirst;
    
        for (++pFirst; pFirst != pLast; ++pFirst, ++previous)
        {
            // if equal to previous
            if (pPred(*pFirst, *previous))
            {
                if (previous == result)
                {
                    // if we just bumped bump again
                    ++result;
                }
                else if (!pPred(*previous, *result))
                {
                    // if it needs to be copied, copy it
                    *result = *previous;
    
                    // bump
                    ++result;
                }
            }
        }
    
        return result;
    }
    
    template <typename ForwardIterator>
    ForwardIterator not_unique(ForwardIterator pFirst, ForwardIterator pLast)
    {
        return not_unique(pFirst, pLast,
                    std::equal_to<typename ForwardIterator::value_type>());
    }
    
    
    //test
    int main()
    {
        typedef std::vector<int> vec;
    
        int data[] = {1, 4, 7, 7, 2, 2, 2, 3, 9, 9, 5, 4, 2, 8};
        vec v(data, endof(data));
    
        // precondition
        std::sort(v.begin(), v.end());
        print("before", v);
    
        // duplicatify (it's a word now)
        vec::iterator iter = not_unique(v.begin(), v.end());
        print("after", v);
    
        // remove extra
        v.erase(iter, v.end());
        print("erased", v);
    }
    
    0 讨论(0)
  • 2021-01-04 06:58

    I think that from a big O standpoint, you have it implemented as good as it gets. The overriding cost is the sort, which is O(N log N). One possibility, though, would be to build up a new vector with the duplicate entries rather than use the existing vector with the delete operation removing the non-duplicates. However, this would only be better if the distinct number of duplicates is small relative to the total number of entries.

    Consider the extreme example. If the original array consisted of 1,000 entries with only one duplicate, then the output would be a vector with just one value. It might be a bit more efficient to create the new vector with one entry rather than deleting the other 999 entries from the original vector. I suspect, however, that in real world testing, the savings of that change could be difficult to measure.

    Edit I was just thinking about this in terms of "interview" question. In other words, this is not a terribly useful answer. But it would be possible to solve this in O(N) (linear time) as opposed to O(N Log N). Use storage space instead of CPU. Create two "bit" arrays with them cleared initially. Loop through your vector of integer values. Look up each value in the first bit array. If it is not set, then set the bit (set it to 1). If it is set, then set the corresponding bit in the second array (indicating a duplicate). After all vector entries are processed, scan through the second array and output the integers that are duplicates (indicated by the bits set in the second bit array). The reason for using bit arrays is just for space efficiency. If dealing with 4-byte integers, then the raw space required is (2 * 2^32 / 8 ). But this could actually be turned into a usable algorithm by making it a sparse array. The very pseudo pseudo-code would be something like this:

    bitarray1[infinite_size];
    bitarray2[infinite_size];
    
    clear/zero bitarrays
    
    // NOTE - do not need to sort the input
    foreach value in original vector {
        if ( bitarray1[value] ) 
            // duplicate
            bitarray2[value] = 1
        bitarray1[value] = 1
    }
    
    // At this point, bitarray2 contains a 1 for all duplicate values.
    // Scan it and create the new vector with the answer
    for i = 0 to maxvalue
        if ( bitarray2[i] )
            print/save/keep i
    
    0 讨论(0)
提交回复
热议问题