Find a number where it appears exactly N/2 times

后端未结

关注

 20  1865

旧巷少年郎 2021-01-29 23:17

Here is one of my interview question. Given an array of N elements and where an element appears exactly N/2 times and the rest N/2 elements are unique

20条回答

-上瘾入骨i (楼主)

2021-01-29 23:52

Suppose you have a python algorithm like this:

import math import random def find_duplicate(arr, gap): cost, reps = 0, 0 while True: indexes = sorted((random.randint(0,len(arr)-i-1) for i in xrange(gap)), reverse=True) selection = [arr.pop(i) for i in indexes] selection_set = set(selection) cost += len(selection) reps += 1 if len(selection) > len(selection_set): return cost, reps

The idea is that arr is your set of values and gap is the log base-2 of the size. Each time you select gap elements and see if there are duplicated values. If so, return your cost (in count of elements examined) and the number of iterations (where you examine log2(size) elements per iteration). Otherwise, look at another gap-sized set.

The problem with benchmarking this algorithm is that the creation of the data each time through the loop and alteration of the data is expensive, assuming a large amount of data. (Initially, I was doing 1 000 000 elements with 10 000 000 iterations.)

So let's reduce to an equivalent problem. The data is passed in as n/2 unique elements and n/2 repeated elements. The algorithm picks the random indexes of log2(n) elements and checks for duplicates. Now we don't even have to create the data and to remove elements examined: we can just check if we have two or more indexes over the halfway point. Select gap indexes, check for 2 or more over the halfway point: return if found, otherwise repeat.

import math import random def find_duplicate(total, half, gap): cost, reps = 0, 0 while True: indexes = [random.randint(0,total-i-1) for i in range(gap)] cost += gap reps += 1 above_half = [i for i in indexes if i >= half] if len(above_half) >= 2: return cost, reps else: total -= len(indexes) half -= (len(indexes) - len(above_half))

Now drive the code like this:

if __name__ == '__main__': import sys import collections import datetime for total in [2**i for i in range(5, 21)]: half = total // 2 gap = int(math.ceil(math.log10(total) / math.log10(2))) d = collections.defaultdict(int) total_cost, total_reps = 0, 1000*1000*10 s = datetime.datetime.now() for _ in xrange(total_reps): cost, reps = find_duplicate(total, half, gap) d[reps] += 1 total_cost += cost e = datetime.datetime.now() print "Elapsed: ", (e - s) print "%d elements" % total print "block size %d (log of # elements)" % gap for k in sorted(d.keys()): print k, d[k] average_cost = float(total_cost) / float(total_reps) average_logs = average_cost / gap print "Total cost: ", total_cost print "Average cost in accesses: %f" % average_cost print "Average cost in logs: %f" % average_logs print

If you try this test, you'll find that the number of times the algorithm has to do multiple selections declines with the number of elements in the data. That is, your average cost in logs asymptotically approaches 1.

elements accesses log-accesses 32 6.362279 1.272456 64 6.858437 1.143073 128 7.524225 1.074889 256 8.317139 1.039642 512 9.189112 1.021012 1024 10.112867 1.011287 2048 11.066819 1.006075 4096 12.038827 1.003236 8192 13.022343 1.001719 16384 14.013163 1.000940 32768 15.007320 1.000488 65536 16.004213 1.000263 131072 17.002441 1.000144 262144 18.001348 1.000075 524288 19.000775 1.000041 1048576 20.000428 1.000021

Now is this an argument for the ideal algorithm being log2(n) in the average case? Perhaps. It certainly is not so in the worst case.

Also, you don't have to pick log2(n) elements at once. You can pick 2 and check for equality (but in the degenerate case, you will not find the duplication at all), or check any other number greater for duplication. At this point, all the algorithms that select elements and check for duplication are identical, varying only in how many they pick and how they identify duplication.

0 讨论(0)

查看其它20个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复