Easy interview question got harder: given numbers 1..100, find the missing number(s) given exactly k are missing

前端 未结 30 1127
时光说笑
时光说笑 2020-11-22 07:02

I had an interesting job interview experience a while back. The question started really easy:

Q1: We have a bag containing numbers

相关标签:
30条回答
  • 2020-11-22 07:46

    I'd take a different approach to that question and probe the interviewer for more details about the larger problem he's trying to solve. Depending on the problem and the requirements surrounding it, the obvious set-based solution might be the right thing and the generate-a-list-and-pick-through-it-afterward approach might not.

    For example, it might be that the interviewer is going to dispatch n messages and needs to know the k that didn't result in a reply and needs to know it in as little wall clock time as possible after the n-kth reply arrives. Let's also say that the message channel's nature is such that even running at full bore, there's enough time to do some processing between messages without having any impact on how long it takes to produce the end result after the last reply arrives. That time can be put to use inserting some identifying facet of each sent message into a set and deleting it as each corresponding reply arrives. Once the last reply has arrived, the only thing to be done is to remove its identifier from the set, which in typical implementations takes O(log k+1). After that, the set contains the list of k missing elements and there's no additional processing to be done.

    This certainly isn't the fastest approach for batch processing pre-generated bags of numbers because the whole thing runs O((log 1 + log 2 + ... + log n) + (log n + log n-1 + ... + log k)). But it does work for any value of k (even if it's not known ahead of time) and in the example above it was applied in a way that minimizes the most critical interval.

    0 讨论(0)
  • 2020-11-22 07:48

    You will find it by reading the couple of pages of Muthukrishnan - Data Stream Algorithms: Puzzle 1: Finding Missing Numbers. It shows exactly the generalization you are looking for. Probably this is what your interviewer read and why he posed these questions.

    Now, if only people would start deleting the answers that are subsumed or superseded by Muthukrishnan's treatment, and make this text easier to find. :)


    Also see sdcvvc's directly related answer, which also includes pseudocode (hurray! no need to read those tricky math formulations :)) (thanks, great work!).

    0 讨论(0)
  • 2020-11-22 07:49

    Motivation

    If you want to solve the general-case problem, and you can store and edit the array, then Caf's solution is by far the most efficient. If you can't store the array (streaming version), then sdcvvc's answer is the only type of solution currently suggested.

    The solution I propose is the most efficient answer (so far on this thread) if you can store the problem but can't edit it, and I got the idea from Svalorzen's solution, which solves for 1 or 2 missing items. This solution takes Θ(k*n) time and O(k) and Ω(log(k)) space - with a possibility that it might actually be O(min(k,log(n))) space. It also works well with parallelism.

    Concept

    The idea is that if you use the original approach of comparing sums:
    int sum = SumOf(1,n) - SumOf(array)

    ... then you take the average of the missing numbers:
    average = sum/array_size

    ... which provides a boundary: Of the missing numbers, there's guaranteed to be at least one number less-or-equal to average, and at least one number greater than average. This means that we can split into sub problems that each scan the array [O(n)] and are only concerned with their respective sub-arrays.

    Code

    C-style solution (don't judge me for the global variables, I'm just trying to make the code readable for non-c folks):

    #include "stdio.h"
    
    // Example problem:
    const int array [] = {0, 7, 3, 1, 5};
    const int N = 8; // size of original array
    const int array_size = 5;
    
    int SumOneTo (int n)
    {
        return n*(n-1)/2; // non-inclusive
    }
    
    int MissingItems (const int begin, const int end, int & average)
    {
        // We consider only sub-array where elements, e:
        // begin <= e < end
        
        // Initialise info about missing elements.
        // First assume all are missing:
        int n = end - begin;
        int sum = SumOneTo(end) - SumOneTo(begin);
    
        // Minus everything that we see (ie not missing):
        for (int i = 0; i < array_size; ++i)
        {
            if ((begin <= array[i]) && (array[i] < end))
            {
                n -= 1;
                sum -= array[i];
            }
        }
        
        // used by caller:
        average = sum/n;
        return n;
    }
    
    void Find (const int begin, const int end)
    {
        int average;
    
        if (MissingItems(begin, end, average) == 1)
        {
            printf(" %d", average); // average(n) is same as n
            return;
        }
        
        Find(begin, average + 1); // at least one missing here
        Find(average + 1, end); // at least one here also
    }
    
    int main ()
    {   
        printf("Missing items:");
        
        Find(0, N);
        
        printf("\n");
    }
    

    Analysis

    Ignoring recursion for a moment, each function call clearly takes O(n) time and O(1) space. Note that sum can equal as much as n(n-1)/2, so requires double the amount of bits needed to store n-1. At most this means than we effectively need two extra elements worth of space, regardless of the size of the array or k, hence it's still O(1) space under the normal conventions.

    It's not so obvious how many function calls there are for k missing elements, so I'll provide a visual. Your original sub-array (connected array) is the full array, which has all k missing elements in it. We'll imagine them in increasing order, where -- represent connections (part of same sub-array):

    m1 -- m2 -- m3 -- m4 -- (...) -- mk-1 -- mk

    The effect of the Find function is to disconnect the missing elements into different non-overlapping sub-arrays. It guarantees that there's at least one missing element in each sub-array, which means breaking exactly one connection.

    What this means is that regardless of how the splits occur, it will always take k-1 Find function calls to do the work of finding the sub-arrays that have only one missing element in it.

    So the time complexity is Θ((k-1 + k) * n) = Θ(k*n).

    For the space complexity, if we divide proportionally each time then we get O(log(k)) space complexity, but if we only separate one at a time it gives us O(k).

    Discussion

    I actually suspect we the space complexity is a smaller O(min(k,log(n))), but it's harder to prove. My intuition: Where the average performs badly at separation is when there's an outlier, but because of this the separation then removes that outlier. In normal arrays, elements could all be exponentially different, but in this case they're all bound by n.

    0 讨论(0)
  • 2020-11-22 07:50

    You can solve Q2 if you have the sum of both lists and the product of both lists.

    (l1 is the original, l2 is the modified list)

    d = sum(l1) - sum(l2)
    m = mul(l1) / mul(l2)
    

    We can optimise this since the sum of an arithmetic series is n times the average of the first and last terms:

    n = len(l1)
    d = (n/2)*(n+1) - sum(l2)
    

    Now we know that (if a and b are the removed numbers):

    a + b = d
    a * b = m
    

    So we can rearrange to:

    a = s - b
    b * (s - b) = m
    

    And multiply out:

    -b^2 + s*b = m
    

    And rearrange so the right side is zero:

    -b^2 + s*b - m = 0
    

    Then we can solve with the quadratic formula:

    b = (-s + sqrt(s^2 - (4*-1*-m)))/-2
    a = s - b
    

    Sample Python 3 code:

    from functools import reduce
    import operator
    import math
    x = list(range(1,21))
    sx = (len(x)/2)*(len(x)+1)
    x.remove(15)
    x.remove(5)
    mul = lambda l: reduce(operator.mul,l)
    s = sx - sum(x)
    m = mul(range(1,21)) / mul(x)
    b = (-s + math.sqrt(s**2 - (-4*(-m))))/-2
    a = s - b
    print(a,b) #15,5
    

    I do not know the complexity of the sqrt, reduce and sum functions so I cannot work out the complexity of this solution (if anyone does know please comment below.)

    0 讨论(0)
  • 2020-11-22 07:51

    May be this algorithm can work for question 1:

    1. Precompute xor of first 100 integers(val=1^2^3^4....100)
    2. xor the elements as they keep coming from input stream ( val1=val1^next_input)
    3. final answer=val^val1

    Or even better:

    def GetValue(A)
      val=0
      for i=1 to 100
        do
          val=val^i
        done
      for value in A:
        do
          val=val^value 
        done
      return val
    

    This algorithm can in fact be expanded for two missing numbers. The first step remains the same. When we call GetValue with two missing numbers the result will be a a1^a2 are the two missing numbers. Lets say

    val = a1^a2

    Now to sieve out a1 and a2 from val we take any set bit in val. Lets say the ith bit is set in val. That means that a1 and a2 have different parity at ith bit position. Now we do another iteration on the original array and keep two xor values. One for the numbers which have the ith bit set and other which doesn't have the ith bit set. We now have two buckets of numbers, and its guranteed that a1 and a2 will lie in different buckets. Now repeat the same what we did for finding one missing element on each of the bucket.

    0 讨论(0)
  • Can you check if every number exists? If yes you may try this:

    S = sum of all numbers in the bag (S < 5050)
    Z = sum of the missing numbers 5050 - S

    if the missing numbers are x and y then:

    x = Z - y and
    max(x) = Z - 1

    So you check the range from 1 to max(x) and find the number

    0 讨论(0)
提交回复
热议问题