Find a number where it appears exactly N/2 times

后端 未结 20 1861
旧巷少年郎
旧巷少年郎 2021-01-29 23:17

Here is one of my interview question. Given an array of N elements and where an element appears exactly N/2 times and the rest N/2 elements are unique

相关标签:
20条回答
  • 2021-01-29 23:47

    There is a constant time solution if you are ready to accept a small probability of error. Randomly samples two values from the array, if they are the same, you found the value you were looking for. At each step, you have a 0.75 probability of not finishing. And because for every epsilon, there exists one n such that (3/4)^n < eps, we can sample at most n time and return an error if we did not found a matching pair.

    Also remark that, if we keep sampling until we found a pair, the expected running time is constant, but the worst case running time is not bounded.

    0 讨论(0)
  • 2021-01-29 23:51

    Restating my solution from a comment to Ganesh's version so I can format it:

    for (i=0; i<N-2; i+=3) { 
       if a[i] == a[1+1] || a[i] == a[i+2] return a[i];
       if a[i+1] == a[i+2] return a[i+1]; 
    } 
    return a[N-1]; // for very small N
    

    Probability of winning after 1 iteration: 50%

    Probability of winning after 2 iterations: 75%

    Etc.

    Worst case, O(n) time O(1) space.

    Note that after N/4 iterations you've used up all the N/2 unique numbers, so this loop will never iterate through more than 3/4 of the array if it is as specified.

    0 讨论(0)
  • 2021-01-29 23:52

    Suppose you have a python algorithm like this:

    import math
    import random
    
    def find_duplicate(arr, gap):
        cost, reps = 0, 0
        while True:
            indexes = sorted((random.randint(0,len(arr)-i-1) for i in xrange(gap)), reverse=True)
            selection = [arr.pop(i) for i in indexes]
            selection_set = set(selection)
            cost += len(selection)
            reps += 1
            if len(selection) > len(selection_set):
                return cost, reps
    

    The idea is that arr is your set of values and gap is the log base-2 of the size. Each time you select gap elements and see if there are duplicated values. If so, return your cost (in count of elements examined) and the number of iterations (where you examine log2(size) elements per iteration). Otherwise, look at another gap-sized set.

    The problem with benchmarking this algorithm is that the creation of the data each time through the loop and alteration of the data is expensive, assuming a large amount of data. (Initially, I was doing 1 000 000 elements with 10 000 000 iterations.)

    So let's reduce to an equivalent problem. The data is passed in as n/2 unique elements and n/2 repeated elements. The algorithm picks the random indexes of log2(n) elements and checks for duplicates. Now we don't even have to create the data and to remove elements examined: we can just check if we have two or more indexes over the halfway point. Select gap indexes, check for 2 or more over the halfway point: return if found, otherwise repeat.

    import math
    import random
    
    def find_duplicate(total, half, gap):
        cost, reps = 0, 0
        while True:
            indexes = [random.randint(0,total-i-1) for i in range(gap)]
            cost += gap
            reps += 1
            above_half = [i for i in indexes if i >= half]
            if len(above_half) >= 2:
                return cost, reps
            else:
                total -= len(indexes)
                half -= (len(indexes) - len(above_half))
    

    Now drive the code like this:

    if __name__ == '__main__':
        import sys
        import collections
        import datetime
        for total in [2**i for i in range(5, 21)]:
            half = total // 2
            gap = int(math.ceil(math.log10(total) / math.log10(2)))
            d = collections.defaultdict(int)
            total_cost, total_reps = 0, 1000*1000*10
            s = datetime.datetime.now()
            for _ in xrange(total_reps):
                cost, reps = find_duplicate(total, half, gap)
                d[reps] += 1
                total_cost += cost
            e = datetime.datetime.now()
            print "Elapsed: ", (e - s)
            print "%d elements" % total
            print "block size %d (log of # elements)" % gap
            for k in sorted(d.keys()):
                print k, d[k]
            average_cost = float(total_cost) / float(total_reps)
            average_logs = average_cost / gap
            print "Total cost: ", total_cost
            print "Average cost in accesses: %f" % average_cost
            print "Average cost in logs: %f" % average_logs
            print
    

    If you try this test, you'll find that the number of times the algorithm has to do multiple selections declines with the number of elements in the data. That is, your average cost in logs asymptotically approaches 1.

    elements    accesses    log-accesses
    32          6.362279    1.272456
    64          6.858437    1.143073
    128         7.524225    1.074889
    256         8.317139    1.039642
    512         9.189112    1.021012
    1024        10.112867   1.011287
    2048        11.066819   1.006075
    4096        12.038827   1.003236
    8192        13.022343   1.001719
    16384       14.013163   1.000940
    32768       15.007320   1.000488
    65536       16.004213   1.000263
    131072      17.002441   1.000144
    262144      18.001348   1.000075
    524288      19.000775   1.000041
    1048576     20.000428   1.000021
    

    Now is this an argument for the ideal algorithm being log2(n) in the average case? Perhaps. It certainly is not so in the worst case.

    Also, you don't have to pick log2(n) elements at once. You can pick 2 and check for equality (but in the degenerate case, you will not find the duplication at all), or check any other number greater for duplication. At this point, all the algorithms that select elements and check for duplication are identical, varying only in how many they pick and how they identify duplication.

    0 讨论(0)
  • 2021-01-29 23:53

    Algorithm RepeatedElement(a, n)

    while (true) do
    {
       i=Random() mod n+1; j=Random() mod n+1;
       // i and j are random numbers in the range [1,n]
       if ((i ≠ j) and a[i]=a[j])) then return;
    }
    
    0 讨论(0)
  • 2021-01-29 23:54

    It's fairly simple to see that no O(log n) algorithm exists. Clearly you have to look at the array elements to figure out which is the repeated element, but no matter what order you choose to look at the elements, the first floor(n/2) elements you look at might all be unique. You could simply be unlucky. If that happened, you would have no way of knowing which was the repeated element. Since no algorithm that uses less than floor(n/2) array references or fewer on every run will work, there is definitely no sub-linear algorithm.

    0 讨论(0)
  • 2021-01-29 23:57

    If you are told that the element you are looking for is the non-unique one surely the quickest way to do it is to iterate along the array until you find two the same and then return that element and stop looking. At most you have to search half the array.

    I think this is O(n) so I guess it doesn't really help.

    It seems too simple so I think I don't understand the problem correctly.

    0 讨论(0)
提交回复
热议问题