Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM

后端 未结 30 1897
栀梦
栀梦 2020-12-22 14:33

I have a computer with 1 MB of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the

相关标签:
30条回答
  • 2020-12-22 15:04

    Suppose this task is possible. Just prior to output, there will be an in-memory representation of the million sorted numbers. How many different such representations are there? Since there may be repeated numbers we can't use nCr (choose), but there is an operation called multichoose that works on multisets.

    • There are 2.2e2436455 ways to choose a million numbers in range 0..99,999,999.
    • That requires 8,093,730 bits to represent every possible combination, or 1,011,717 bytes.

    So theoretically it may be possible, if you can come up with a sane (enough) representation of the sorted list of numbers. For example, an insane representation might require a 10MB lookup table or thousands of lines of code.

    However, if "1M RAM" means one million bytes, then clearly there is not enough space. The fact that 5% more memory makes it theoretically possible suggests to me that the representation will have to be VERY efficient and probably not sane.

    0 讨论(0)
  • 2020-12-22 15:04

    There is one solution to this problem across all possible inputs. Cheat.

    1. Read m values over TCP, where m is near the max that can be sorted in memory, maybe n/4.
    2. Sort the 250,000 (or so) numbers and output them.
    3. Repeat for the other 3 quarters.
    4. Let the receiver merge the 4 lists of numbers it has received as it processes them. (It's not much slower than using a single list.)
    0 讨论(0)
  • 2020-12-22 15:06

    We could play with the networking stack to send the numbers in sorted order before we have all the numbers. If you send 1M of data, TCP/IP will break it into 1500 byte packets and stream them in order to the target. Each packet will be given a sequence number.

    We can do this by hand. Just before we fill our RAM we can sort what we have and send the list to our target but leave holes in our sequence around each number. Then process the 2nd 1/2 of the numbers the same way using those holes in the sequence.

    The networking stack on the far end will assemble the resulting data stream in order of sequence before handing it up to the application.

    It's using the network to perform a merge sort. This is a total hack, but I was inspired by the other networking hack listed previously.

    0 讨论(0)
  • 2020-12-22 15:07

    I would exploit the retransmission behaviour of TCP.

    1. Make the TCP component create a large receive window.
    2. Receive some amount of packets without sending an ACK for them.
      • Process those in passes creating some (prefix) compressed data structure
      • Send duplicate ack for last packet that is not needed anymore/wait for retransmission timeout
      • Goto 2
    3. All packets were accepted

    This assumes some kind of benefit of buckets or multiple passes.

    Probably by sorting the batches/buckets and merging them. -> radix trees

    Use this technique to accept and sort the first 80% then read the last 20%, verify that the last 20% do not contain numbers that would land in the first 20% of the lowest numbers. Then send the 20% lowest numbers, remove from memory, accept the remaining 20% of new numbers and merge.**

    0 讨论(0)
  • 2020-12-22 15:08

    I have a computer with 1M of RAM and no other local storage

    Another way to cheat: you could use non-local (networked) storage instead (your question does not preclude this) and call a networked service that could use straightforward disk-based mergesort (or just enough RAM to sort in-memory, since you only need to accept 1M numbers), without needing the (admittedly extremely ingenious) solutions already given.

    This might be cheating, but it's not clear whether you are looking for a solution to a real-world problem, or a puzzle that invites bending of the rules... if the latter, then a simple cheat may get better results than a complex but "genuine" solution (which as others have pointed out, can only work for compressible inputs).

    0 讨论(0)
  • 2020-12-22 15:09

    There are 10^6 values in a range of 10^8, so there's one value per hundred code points on average. Store the distance from the Nth point to the (N+1)th. Duplicate values have a skip of 0. This means that the skip needs an average of just under 7 bits to store, so a million of them will happily fit into our 8 million bits of storage.

    These skips need to be encoded into a bitstream, say by Huffman encoding. Insertion is by iterating through the bitstream and rewriting after the new value. Output by iterating through and writing out the implied values. For practicality, it probably wants to be done as, say, 10^4 lists covering 10^4 code points (and an average of 100 values) each.

    A good Huffman tree for random data can be built a priori by assuming a Poisson distribution (mean=variance=100) on the length of the skips, but real statistics can be kept on the input and used to generate an optimal tree to deal with pathological cases.

    0 讨论(0)
提交回复
热议问题