Interview Question: Find Median From Mega Number Of Integers

后端 未结 9 1145
暖寄归人
暖寄归人 2020-12-12 14:02

There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an re

相关标签:
9条回答
  • 2020-12-12 14:42

    Make a pass through the file and find count of integers and minimum and maximum integer value.

    Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.

    partition count > count => median lies within that partition.

    Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.

    Am sure this'd work for an arbitrary number of partitions as well.

    0 讨论(0)
  • 2020-12-12 14:42

    I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.

    Example: Numbers are randomly generated and stored into an (expanding) array. How wouldyoukeep track of the median?

    Our data structure brainstorm might look like the following:

    • Linked list? Probably not. Linked lists tend not to do very well with accessing and sorting numbers.

    • Array? Maybe, but you already have an array. Could you somehow keep the elements sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.

    • Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.

    • Heap? A heap is really good at basic ordering and keeping track of max and mins. This is actually interesting—if you had two heaps, you could keep track of the bigger half and the smaller half of the elements. The bigger half is kept in a min heap, such that the smallest element in the bigger half is at the root.The smaller half is kept in a max heap, such that the biggest element of the smaller half is at the root. Now, with these data structures, you have the potential median elements at the roots. If the heaps are no longer the same size, you can quickly "rebalance" the heaps by popping an element off the one heap and pushing it onto the other.

    Note that the more problems you do, the more developed your instinct on which data structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.

    0 讨论(0)
  • 2020-12-12 14:45

    Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.

    Now you count up in that histogram until you reach the bin that covers the midpoint of the values.

    Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.

    Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.

    Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).

    Here's some sample Scala code that does this:

    def medianFinder(numbers: Iterable[Int]) = {
      def midArgMid(a: Array[Long], mid: Long) = {
        val cuml = a.scanLeft(0L)(_ + _).drop(1)
        cuml.zipWithIndex.dropWhile(_._1 < mid).head
      }
      val topHistogram = new Array[Long](65536)
      var count = 0L
      numbers.foreach(number => {
        count += 1
        topHistogram(number>>>16) += 1
      })
      val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
      val botHistogram = new Array[Long](65536)
      numbers.foreach(number => {
        if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
      })
      val (botCount,botIndex) =
        midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
      (topIndex<<16) + botIndex
    }
    

    and here it is working on a small set of input data:

    scala> medianFinder(List(1,123,12345,1234567,123456789))
    res18: Int = 12345
    

    If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.

    0 讨论(0)
提交回复
热议问题