Algorithm: efficient way to remove duplicate integers from an array

前端 未结 30 2296
离开以前
离开以前 2020-11-22 16:03

I got this problem from an interview with Microsoft.

Given an array of random integers, write an algorithm in C that removes duplicated numbers an

相关标签:
30条回答
  • 2020-11-22 16:28

    Create a BinarySearchTree which has O(n) complexity.

    0 讨论(0)
  • 2020-11-22 16:30

    An array should obviously be "traversed" right-to-left to avoid unneccessary copying of values back and forth.

    If you have unlimited memory, you can allocate a bit array for sizeof(type-of-element-in-array) / 8 bytes to have each bit signify whether you've already encountered corresponding value or not.

    If you don't, I can't think of anything better than traversing an array and comparing each value with values that follow it and then if duplicate is found, remove these values altogether. This is somewhere near O(n^2) (or O((n^2-n)/2)).

    IBM has an article on kinda close subject.

    0 讨论(0)
  • 2020-11-22 16:30

    Use bloom filter for hashing. This will reduce the memory overhead very significantly.

    0 讨论(0)
  • 2020-11-22 16:31

    Well, it's basic implementation is quite simple. Go through all elements, check whether there are duplicates in the remaining ones and shift the rest over them.

    It's terrible inefficient and you could speed it up by a helper-array for the output or sorting/binary trees, but this doesn't seem to be allowed.

    0 讨论(0)
  • 2020-11-22 16:35

    I've posted this once before on SO, but I'll reproduce it here because it's pretty cool. It uses hashing, building something like a hash set in place. It's guaranteed to be O(1) in axillary space (the recursion is a tail call), and is typically O(N) time complexity. The algorithm is as follows:

    1. Take the first element of the array, this will be the sentinel.
    2. Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
    3. Move all elements for which the index is equal to the hash to the beginning of the array.
    4. Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
    5. What's left between the properly hashed elements and the duplicate elements will be the elements that couldn't be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.

    This can be shown to be O(N) provided no pathological scenario in the hashing: Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it's slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it's amazingly fast.

    Edit: In current implementations of D, hash_t is 32 bits. Everything about this algorithm assumes that there will be very few, if any, hash collisions in full 32-bit space. Collisions may, however, occur frequently in the modulus space. However, this assumption will in all likelihood be true for any reasonably sized data set. If the key is less than or equal to 32 bits, it can be its own hash, meaning that a collision in full 32-bit space is impossible. If it is larger, you simply can't fit enough of them into 32-bit memory address space for it to be a problem. I assume hash_t will be increased to 64 bits in 64-bit implementations of D, where datasets can be larger. Furthermore, if this ever did prove to be a problem, one could change the hash function at each level of recursion.

    Here's an implementation in the D programming language:

    void uniqueInPlace(T)(ref T[] dataIn) {
        uniqueInPlaceImpl(dataIn, 0);
    }
    
    void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {
        if(dataIn.length - start < 2)
            return;
    
        invariant T sentinel = dataIn[start];
        T[] data = dataIn[start + 1..$];
    
        static hash_t getHash(T elem) {
            static if(is(T == uint) || is(T == int)) {
                return cast(hash_t) elem;
            } else static if(__traits(compiles, elem.toHash)) {
                return elem.toHash;
            } else {
                static auto ti = typeid(typeof(elem));
                return ti.getHash(&elem);
            }
        }
    
        for(size_t index = 0; index < data.length;) {
            if(data[index] == sentinel) {
                index++;
                continue;
            }
    
            auto hash = getHash(data[index]) % data.length;
            if(index == hash) {
                index++;
                continue;
            }
    
            if(data[index] == data[hash]) {
                data[index] = sentinel;
                index++;
                continue;
            }
    
            if(data[hash] == sentinel) {
                swap(data[hash], data[index]);
                index++;
                continue;
            }
    
            auto hashHash = getHash(data[hash]) % data.length;
            if(hashHash != hash) {
                swap(data[index], data[hash]);
                if(hash < index)
                    index++;
            } else {
                index++;
            }
        }
    
    
        size_t swapPos = 0;
        foreach(i; 0..data.length) {
            if(data[i] != sentinel && i == getHash(data[i]) % data.length) {
                swap(data[i], data[swapPos++]);
            }
        }
    
        size_t sentinelPos = data.length;
        for(size_t i = swapPos; i < sentinelPos;) {
            if(data[i] == sentinel) {
                swap(data[i], data[--sentinelPos]);
            } else {
                i++;
            }
        }
    
        dataIn = dataIn[0..sentinelPos + start + 1];
        uniqueInPlaceImpl(dataIn, start + swapPos + 1);
    }
    
    0 讨论(0)
  • 2020-11-22 16:35

    Let's see:

    • O(N) pass to find min/max allocate
    • bit-array for found
    • O(N) pass swapping duplicates to end.
    0 讨论(0)
提交回复
热议问题