Interview Question: Find Median From Mega Number Of Integers

后端 未结 9 1144
暖寄归人
暖寄归人 2020-12-12 14:02

There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an re

相关标签:
9条回答
  • 2020-12-12 14:22

    My best guess that probabilistic median of medians would be the fastest one. Recipe:

    1. Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
    2. Then calculate median of these integers and assign it to variable X_new.
    3. If iteration is not first - calculate median of two medians:

      X_global = (X_global + X_new) / 2

    4. When you will see that X_global fluctuates not much - this means that you found approximate median of data.

    But there some notes :

    • question arises - Is median error acceptable or not.
    • integers must be distributed randomly in a uniform way, for solution to work

    EDIT: I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:

    X_global = k*X_global + (1.-k)*X_new :

    k from [0.5 .. 1.], and increases in each iteration.

    Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:

    #include <stdlib.h>
    #include <stdio.h>
    #include <time.h>
    
    #define ARRAY_SIZE 100000000
    #define RANGE_SIZE 1000
    
    // probabilistic median of medians method
    // should print 5000 as data average
    // from ARRAY_SIZE of elements
    int main (int argc, const char * argv[]) {
        int iter = 0;
        int X_global = 0;
        int X_new = 0;
        int i = 0;
        float dk = 0.002;
        float k = 0.5;
        srand(time(NULL));
    
        while (i<ARRAY_SIZE && k!=1.) {
            X_new=0;
            for (int j=i; j<i+RANGE_SIZE; j++) {
                X_new+=rand()%10000 + 1;
            }
            X_new/=RANGE_SIZE;
    
            if (iter>0) {
                k += dk;
                k = (k>1.)? 1.:k;
                X_global = k*X_global+(1.-k)*X_new;
    
            }
            else {
                X_global = X_new;
            }
    
            i+=RANGE_SIZE+1;
            iter++;
            printf("iter %d, median = %d \n",iter,X_global);
        }
    
        return 0;
    
    }
    

    Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.

    Good luck.

    0 讨论(0)
  • 2020-12-12 14:23
    1. Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
    2. Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.

    The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.

    Given n = number of integers in the original file:

    • Running time: O(nlogn)
    • Memory: O(1), adjustable
    • Disk: O(n)
    0 讨论(0)
  • 2020-12-12 14:26

    You can use the Medians of Medians algorithm.

    0 讨论(0)
  • 2020-12-12 14:30

    Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.

    0 讨论(0)
  • 2020-12-12 14:34

    Here is the algorithm described by @Rex Kerr implemented in Java.

    /**
     * Computes the median.
     * @param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
     * @return the median (number of rank ceil((m+1)/2) ) of the array as a string
     */
    static String computeMedian(String[] arr) {
    
        // rank of the median element
        int m = (int) Math.ceil((arr.length+1)/2.0);
    
        String bitMask = "";
        int zeroBin = 0;
    
        while (bitMask.length() < arr[0].length()) {
    
            // puts elements which conform to the bitMask into one of two buckets
            for (String curr : arr) {
                if (curr.startsWith(bitMask))
                    if (curr.charAt(bitMask.length()) == '0')
                        zeroBin++;
            }
    
            // decides in which bucket the median is located
            if (zeroBin >= m)
                bitMask = bitMask.concat("0");
            else {
                m -= zeroBin;
                bitMask = bitMask.concat("1");
            }
    
            zeroBin = 0;
        }
    
        return bitMask;
    }
    

    Some test cases and updates to the algorithm can be found here.

    0 讨论(0)
  • 2020-12-12 14:37

    If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.

    If you can't read them into memory, this is what I came up with:

    1. Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.

    2. Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.

    3. Do another pass through, finding the next x largest integers less than x1, the least of which is x2.

    4. I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.

    0 讨论(0)
提交回复
热议问题