Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM

后端 未结 30 1889
栀梦
栀梦 2020-12-22 14:33

I have a computer with 1 MB of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the

相关标签:
30条回答
  • 2020-12-22 14:46

    Now aiming to an actual solution, covering all possible cases of input in the 8 digit range with only 1MB of RAM. NOTE: work in progress, tomorrow will continue. Using arithmetic coding of deltas of the sorted ints, worst case for 1M sorted ints would cost about 7bits per entry (since 99999999/1000000 is 99, and log2(99) is almost 7 bits).

    But you need the 1m integers sorted to get to 7 or 8 bits! Shorter series would have bigger deltas, therefore more bits per element.

    I'm working on taking as many as possible and compressing (almost) in-place. First batch of close to 250K ints would need about 9 bits each at best. So result would take about 275KB. Repeat with remaining free memory a few times. Then decompress-merge-in-place-compress those compressed chunks. This is quite hard, but possible. I think.

    The merged lists would get closer and closer to the 7bit per integer target. But I don't know how many iterations it would take of the merge loop. Perhaps 3.

    But the imprecision of the arithmetic coding implementation might make it impossible. If this problem is possible at all, it would be extremely tight.

    Any volunteers?

    0 讨论(0)
  • 2020-12-22 14:47

    Here's some working C++ code which solves the problem.

    Proof that the memory constraints are satisfied:

    Editor: There is no proof of the maximum memory requirements offered by the author either in this post or in his blogs. Since the number of bits necessary to encode a value depends on the values previously encoded, such a proof is likely non-trivial. The author notes that the largest encoded size he could stumble upon empirically was 1011732, and chose the buffer size 1013000 arbitrarily.

    typedef unsigned int u32;
    
    namespace WorkArea
    {
        static const u32 circularSize = 253250;
        u32 circular[circularSize] = { 0 };         // consumes 1013000 bytes
    
        static const u32 stageSize = 8000;
        u32 stage[stageSize];                       // consumes 32000 bytes
    
        ...
    

    Together, these two arrays take 1045000 bytes of storage. That leaves 1048576 - 1045000 - 2×1024 = 1528 bytes for remaining variables and stack space.

    It runs in about 23 seconds on my Xeon W3520. You can verify that the program works using the following Python script, assuming a program name of sort1mb.exe.

    from subprocess import *
    import random
    
    sequence = [random.randint(0, 99999999) for i in xrange(1000000)]
    
    sorter = Popen('sort1mb.exe', stdin=PIPE, stdout=PIPE)
    for value in sequence:
        sorter.stdin.write('%08d\n' % value)
    sorter.stdin.close()
    
    result = [int(line) for line in sorter.stdout]
    print('OK!' if result == sorted(sequence) else 'Error!')
    

    A detailed explanation of the algorithm can be found in the following series of posts:

    • 1MB Sorting Explained
    • Arithmetic Coding and the 1MB Sorting Problem
    • Arithmetic Encoding Using Fixed-Point Math
    0 讨论(0)
  • 2020-12-22 14:50

    My suggestions here owe a lot to Dan's solution

    First off I assume the solution must handle all possible input lists. I think the popular answers do not make this assumption (which IMO is a huge mistake).

    It is known that no form of lossless compression will reduce the size of all inputs.

    All the popular answers assume they will be able to apply compression effective enough to allow them extra space. In fact, a chunk of extra space large enough to hold some portion of their partially completed list in an uncompressed form and allow them to perform their sorting operations. This is just a bad assumption.

    For such a solution, anyone with knowledge of how they do their compression will be able to design some input data that does not compress well for this scheme, and the "solution" will most likely then break due to running out of space.

    Instead I take a mathematical approach. Our possible outputs are all the lists of length LEN consisting of elements in the range 0..MAX. Here the LEN is 1,000,000 and our MAX is 100,000,000.

    For arbitrary LEN and MAX, the amount of bits needed to encode this state is:

    Log2(MAX Multichoose LEN)

    So for our numbers, once we have completed recieving and sorting, we will need at least Log2(100,000,000 MC 1,000,000) bits to store our result in a way that can uniquely distinguish all possible outputs.

    This is ~= 988kb. So we actually have enough space to hold our result. From this point of view, it is possible.

    [Deleted pointless rambling now that better examples exist...]

    Best answer is here.

    Another good answer is here and basically uses insertion sort as the function to expand the list by one element (buffers a few elements and pre-sorts, to allow insertion of more than one at a time, saves a bit of time). uses a nice compact state encoding too, buckets of seven bit deltas

    0 讨论(0)
  • 2020-12-22 14:50

    Sorting is a secondary problem here. As other said, just storing the integers is hard, and cannot work on all inputs, since 27 bits per number would be necessary.

    My take on this is: store only the differences between the consecutive (sorted) integers, as they will be most likely small. Then use a compression scheme, e.g. with 2 additional bits per input number, to encode how many bits the number is stored on. Something like:

    00 -> 5 bits
    01 -> 11 bits
    10 -> 19 bits
    11 -> 27 bits
    

    It should be possible to store a fair number of possible input lists within the given memory constraint. The maths of how to pick the compression scheme to have it work on the maximum number of inputs, are beyond me.

    I hope you may be able to exploit domain-specific knowledge of your input to find a good enough integer compression scheme based on this.

    Oh and then, you do an insertion sort on that sorted list as you receive data.

    0 讨论(0)
  • 2020-12-22 14:51

    A solution is possible only because of the difference between 1 megabyte and 1 million bytes. There are about 2 to the power 8093729.5 different ways to choose 1 million 8-digit numbers with duplicates allowed and order unimportant, so a machine with only 1 million bytes of RAM doesn't have enough states to represent all the possibilities. But 1M (less 2k for TCP/IP) is 1022*1024*8 = 8372224 bits, so a solution is possible.

    Part 1, initial solution

    This approach needs a little more than 1M, I'll refine it to fit into 1M later.

    I'll store a compact sorted list of numbers in the range 0 to 99999999 as a sequence of sublists of 7-bit numbers. The first sublist holds numbers from 0 to 127, the second sublist holds numbers from 128 to 255, etc. 100000000/128 is exactly 781250, so 781250 such sublists will be needed.

    Each sublist consists of a 2-bit sublist header followed by a sublist body. The sublist body takes up 7 bits per sublist entry. The sublists are all concatenated together, and the format makes it possible to tell where one sublist ends and the next begins. The total storage required for a fully populated list is 2*781250 + 7*1000000 = 8562500 bits, which is about 1.021 M-bytes.

    The 4 possible sublist header values are:

    00 Empty sublist, nothing follows.

    01 Singleton, there is only one entry in the sublist and and next 7 bits hold it.

    10 The sublist holds at least 2 distinct numbers. The entries are stored in non-decreasing order, except that the last entry is less than or equal to the first. This allows the end of the sublist to be identified. For example, the numbers 2,4,6 would be stored as (4,6,2). The numbers 2,2,3,4,4 would be stored as (2,3,4,4,2).

    11 The sublist holds 2 or more repetitions of a single number. The next 7 bits give the number. Then come zero or more 7-bit entries with the value 1, followed by a 7-bit entry with the value 0. The length of the sublist body dictates the number of repetitions. For example, the numbers 12,12 would be stored as (12,0), the numbers 12,12,12 would be stored as (12,1,0), 12,12,12,12 would be (12,1,1,0) and so on.

    I start off with an empty list, read a bunch of numbers in and store them as 32 bit integers, sort the new numbers in place (using heapsort, probably) and then merge them into a new compact sorted list. Repeat until there are no more numbers to read, then walk the compact list once more to generate the output.

    The line below represents memory just before the start of the list merge operation. The "O"s are the region that hold the sorted 32-bit integers. The "X"s are the region that hold the old compact list. The "=" signs are the expansion room for the compact list, 7 bits for each integer in the "O"s. The "Z"s are other random overhead.

    ZZZOOOOOOOOOOOOOOOOOOOOOOOOOO==========XXXXXXXXXXXXXXXXXXXXXXXXXX
    

    The merge routine starts reading at the leftmost "O" and at the leftmost "X", and starts writing at the leftmost "=". The write pointer doesn't catch the compact list read pointer until all of the new integers are merged, because both pointers advance 2 bits for each sublist and 7 bits for each entry in the old compact list, and there is enough extra room for the 7-bit entries for the new numbers.

    Part 2, cramming it into 1M

    To Squeeze the solution above into 1M, I need to make the compact list format a bit more compact. I'll get rid of one of the sublist types, so that there will be just 3 different possible sublist header values. Then I can use "00", "01" and "1" as the sublist header values and save a few bits. The sublist types are:

    A Empty sublist, nothing follows.

    B Singleton, there is only one entry in the sublist and and next 7 bits hold it.

    C The sublist holds at least 2 distinct numbers. The entries are stored in non-decreasing order, except that the last entry is less than or equal to the first. This allows the end of the sublist to be identified. For example, the numbers 2,4,6 would be stored as (4,6,2). The numbers 2,2,3,4,4 would be stored as (2,3,4,4,2).

    D The sublist consists of 2 or more repetitions of a single number.

    My 3 sublist header values will be "A", "B" and "C", so I need a way to represent D-type sublists.

    Suppose I have the C-type sublist header followed by 3 entries, such as "C[17][101][58]". This can't be part of a valid C-type sublist as described above, since the third entry is less than the second but more than the first. I can use this type of construct to represent a D-type sublist. In bit terms, anywhere I have "C{00?????}{1??????}{01?????}" is an impossible C-type sublist. I'll use this to represent a sublist consisting of 3 or more repetitions of a single number. The first two 7-bit words encode the number (the "N" bits below) and are followed by zero or more {0100001} words followed by a {0100000} word.

    For example, 3 repetitions: "C{00NNNNN}{1NN0000}{0100000}", 4 repetitions: "C{00NNNNN}{1NN0000}{0100001}{0100000}", and so on.
    

    That just leaves lists that hold exactly 2 repetitions of a single number. I'll represent those with another impossible C-type sublist pattern: "C{0??????}{11?????}{10?????}". There's plenty of room for the 7 bits of the number in the first 2 words, but this pattern is longer than the sublist that it represents, which makes things a bit more complex. The five question-marks at the end can be considered not part of the pattern, so I have: "C{0NNNNNN}{11N????}10" as my pattern, with the number to be repeated stored in the "N"s. That's 2 bits too long.

    I'll have to borrow 2 bits and pay them back from the 4 unused bits in this pattern. When reading, on encountering "C{0NNNNNN}{11N00AB}10", output 2 instances of the number in the "N"s, overwrite the "10" at the end with bits A and B, and rewind the read pointer by 2 bits. Destructive reads are ok for this algorithm, since each compact list gets walked only once.

    When writing a sublist of 2 repetitions of a single number, write "C{0NNNNNN}11N00" and set the borrowed bits counter to 2. At every write where the borrowed bits counter is non-zero, it is decremented for each bit written and "10" is written when the counter hits zero. So the next 2 bits written will go into slots A and B, and then the "10" will get dropped onto the end.

    With 3 sublist header values represented by "00", "01" and "1", I can assign "1" to the most popular sublist type. I'll need a small table to map sublist header values to sublist types, and I'll need an occurrence counter for each sublist type so that I know what the best sublist header mapping is.

    The worst case minimal representation of a fully populated compact list occurs when all the sublist types are equally popular. In that case I save 1 bit for every 3 sublist headers, so the list size is 2*781250 + 7*1000000 - 781250/3 = 8302083.3 bits. Rounding up to a 32 bit word boundary, thats 8302112 bits, or 1037764 bytes.

    1M minus the 2k for TCP/IP state and buffers is 1022*1024 = 1046528 bytes, leaving me 8764 bytes to play with.

    But what about the process of changing the sublist header mapping ? In the memory map below, "Z" is random overhead, "=" is free space, "X" is the compact list.

    ZZZ=====XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    

    Start reading at the leftmost "X" and start writing at the leftmost "=" and work right. When it's done the compact list will be a little shorter and it will be at the wrong end of memory:

    ZZZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX=======
    

    So then I'll need to shunt it to the right:

    ZZZ=======XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    

    In the header mapping change process, up to 1/3 of the sublist headers will be changing from 1-bit to 2-bit. In the worst case these will all be at the head of the list, so I'll need at least 781250/3 bits of free storage before I start, which takes me back to the memory requirements of the previous version of the compact list :(

    To get around that, I'll split the 781250 sublists into 10 sublist groups of 78125 sublists each. Each group has its own independent sublist header mapping. Using the letters A to J for the groups:

    ZZZ=====AAAAAABBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    

    Each sublist group shrinks or stays the same during a sublist header mapping change:

    ZZZ=====AAAAAABBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAA=====BBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABB=====CCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABBCCC======DDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABBCCCDDDDD======EEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABBCCCDDDDDEEE======FFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABBCCCDDDDDEEEFFF======GGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGG=======HHIJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHH=======IJJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHI=======JJJJJJJJJJJJJJJJJJJJ
    ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ=======
    ZZZ=======AAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
    

    The worst case temporary expansion of a sublist group during a mapping change is 78125/3 = 26042 bits, under 4k. If I allow 4k plus the 1037764 bytes for a fully populated compact list, that leaves me 8764 - 4096 = 4668 bytes for the "Z"s in the memory map.

    That should be plenty for the 10 sublist header mapping tables, 30 sublist header occurrence counts and the other few counters, pointers and small buffers I'll need, and space I've used without noticing, like stack space for function call return addresses and local variables.

    Part 3, how long would it take to run?

    With an empty compact list the 1-bit list header will be used for an empty sublist, and the starting size of the list will be 781250 bits. In the worst case the list grows 8 bits for each number added, so 32 + 8 = 40 bits of free space are needed for each of the 32-bit numbers to be placed at the top of the list buffer and then sorted and merged. In the worst case, changing the sublist header mapping results in a space usage of 2*781250 + 7*entries - 781250/3 bits.

    With a policy of changing the sublist header mapping after every fifth merge once there are at least 800000 numbers in the list, a worst case run would involve a total of about 30M of compact list reading and writing activity.

    Source:

    http://nick.cleaton.net/ramsortsol.html

    0 讨论(0)
  • 2020-12-22 14:51

    (My original answer was wrong, sorry for the bad math, see below the break.)

    How about this?

    The first 27 bits store the lowest number you have seen, then the difference to the next number seen, encoded as follows: 5 bits to store the number of bits used in storing the difference, then the difference. Use 00000 to indicate that you saw that number again.

    This works because as more numbers are inserted, the average difference between numbers goes down, so you use less bits to store the difference as you add more numbers. I believe this is called a delta list.

    The worst case I can think of is all numbers evenly spaced (by 100), e.g. Assuming 0 is the first number:

    000000000000000000000000000 00111 1100100
                                ^^^^^^^^^^^^^
                                a million times
    
    27 + 1,000,000 * (5+7) bits = ~ 427k
    

    Reddit to the rescue!

    If all you had to do was sort them, this problem would be easy. It takes 122k (1 million bits) to store which numbers you have seen (0th bit on if 0 was seen, 2300th bit on if 2300 was seen, etc.

    You read the numbers, store them in the bit field, and then shift the bits out while keeping a count.

    BUT, you have to remember how many you have seen. I was inspired by the sublist answer above to come up with this scheme:

    Instead of using one bit, use either 2 or 27 bits:

    • 00 means you did not see the number.
    • 01 means you saw it once
    • 1 means you saw it, and the next 26 bits are the count of how many times.

    I think this works: if there are no duplicates, you have a 244k list. In the worst case you see each number twice (if you see one number three times, it shortens the rest of the list for you), that means you have seen 50,000 more than once, and you have seen 950,000 items 0 or 1 times.

    50,000 * 27 + 950,000 * 2 = 396.7k.

    You can make further improvements if you use the following encoding:

    0 means you did not see the number 10 means you saw it once 11 is how you keep count

    Which will, on average, result in 280.7k of storage.

    EDIT: my Sunday morning math was wrong.

    The worst case is we see 500,000 numbers twice, so the math becomes:

    500,000 *27 + 500,000 *2 = 1.77M

    The alternate encoding results in an average storage of

    500,000 * 27 + 500,000 = 1.70M

    : (

    0 讨论(0)
提交回复
热议问题