Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM

后端 未结 30 1895
栀梦
栀梦 2020-12-22 14:33

I have a computer with 1 MB of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the

相关标签:
30条回答
  • 2020-12-22 14:43

    I think one way to think about this is from a combinatorics viewpoint: how many possible combinations of sorted number orderings are there? If we give the combination 0,0,0,....,0 the code 0, and 0,0,0,...,1 the code 1, and 99999999, 99999999, ... 99999999 the code N, what is N? In other words, how big is the result space?

    Well, one way to think about this is noticing that this is a bijection of the problem of finding the number of monotonic paths in an N x M grid, where N = 1,000,000 and M = 100,000,000. In other words, if you have a grid that is 1,000,000 wide and 100,000,000 tall, how many shortest paths from the bottom left to the top right are there? Shortest paths of course require you only ever either move right or up (if you were to move down or left you would be undoing previously accomplished progress). To see how this is a bijection of our number sorting problem, observe the following:

    You can imagine any horizontal leg in our path as a number in our ordering, where the Y location of the leg represents the value.

    enter image description here

    So if the path simply moves to the right all the way to the end, then jumps all the way to the top, that is equivalent to the ordering 0,0,0,...,0. if it instead begins by jumping all the way to the top and then moves to the right 1,000,000 times, that is equivalent to 99999999,99999999,..., 99999999. A path where it moves right once, then up once, then right one, then up once, etc to the very end (then necessarily jumps all the way to the top), is equivalent to 0,1,2,3,...,999999.

    Luckily for us this problem has already been solved, such a grid has (N + M) Choose (M) paths:

    (1,000,000 + 100,000,000) Choose (100,000,000) ~= 2.27 * 10^2436455

    N thus equals 2.27 * 10^2436455, and so the code 0 represents 0,0,0,...,0 and the code 2.27 * 10^2436455 and some change represents 99999999,99999999,..., 99999999.

    In order to store all the numbers from 0 to 2.27 * 10^2436455 you need lg2 (2.27 * 10^2436455) = 8.0937 * 10^6 bits.

    1 megabyte = 8388608 bits > 8093700 bits

    So it appears that we at least actually have enough room to store the result! Now of course the interesting bit is doing the sorting as the numbers stream in. Not sure the best approach to this is given we have 294908 bits remaining. I imagine an interesting technique would be to at each point assume that that is is the entire ordering, finding the code for that ordering, and then as you receive a new number going back and updating the previous code. Hand wave hand wave.

    0 讨论(0)
  • 2020-12-22 14:43

    You just need to store the differences between the numbers in sequence, and use an encoding to compress these sequence numbers. We have 2^23 bits. We shall divide it into 6bit chunks, and let the last bit indicate whether the number extends to another 6 bits (5bits plus extending chunk).

    Thus, 000010 is 1, and 000100 is 2. 000001100000 is 128. Now, we consider the worst cast in representing differences in sequence of a numbers up to 10,000,000. There can be 10,000,000/2^5 differences greater than 2^5, 10,000,000/2^10 differences greater than 2^10, and 10,000,000/2^15 differences greater than 2^15, etc.

    So, we add how many bits it will take to represent our the sequence. We have 1,000,000*6 + roundup(10,000,000/2^5)*6+roundup(10,000,000/2^10)*6+roundup(10,000,000/2^15)*6+roundup(10,000,000/2^20)*4=7935479.

    2^24 = 8388608. Since 8388608 > 7935479, we should easily have enough memory. We will probably need another little bit of memory to store the sum of where are when we insert new numbers. We then go through the sequence, and find where to insert our new number, decrease the next difference if necessary, and shift everything after it right.

    0 讨论(0)
  • 2020-12-22 14:43

    We have 1 MB - 3 KB RAM = 2^23 - 3*2^13 bits = 8388608 - 24576 = 8364032 bits available.

    We are given 10^6 numbers in a 10^8 range. This gives an average gap of ~100 < 2^7 = 128

    Let's first consider the simpler problem of fairly evenly spaced numbers when all gaps are < 128. This is easy. Just store the first number and the 7-bit gaps:

    (27 bits) + 10^6 7-bit gap numbers = 7000027 bits required

    Note repeated numbers have gaps of 0.

    But what if we have gaps larger than 127?

    OK, let's say a gap size < 127 is represented directly, but a gap size of 127 is followed by a continuous 8-bit encoding for the actual gap length:

     10xxxxxx xxxxxxxx                       = 127 .. 16,383
     110xxxxx xxxxxxxx xxxxxxxx              = 16384 .. 2,097,151
    

    etc.

    Note this number representation describes its own length so we know when the next gap number starts.

    With just small gaps < 127, this still requires 7000027 bits.

    There can be up to (10^8)/(2^7) = 781250 23-bit gap number, requiring an extra 16*781,250 = 12,500,000 bits which is too much. We need a more compact and slowly increasing representation of gaps.

    The average gap size is 100 so if we reorder them as [100, 99, 101, 98, 102, ..., 2, 198, 1, 199, 0, 200, 201, 202, ...] and index this with a dense binary Fibonacci base encoding with no pairs of zeros (for example, 11011=8+5+2+1=16) with numbers delimited by '00' then I think we can keep the gap representation short enough, but it needs more analysis.

    0 讨论(0)
  • 2020-12-22 14:45

    Here is a generalized solution to this kind of problem:

    General procedure

    The taken approach is as follows. The algorithm operates on a single buffer of 32-bit words. It performs the following procedure in a loop:

    • We start with a buffer filled with compressed data from the last iteration. The buffer looks like this

      |compressed sorted|empty|

    • Calculate the maximum amount of numbers that can be stored in this buffer, both compressed and uncompressed. Split the buffer into these two sections, beginning with the space for compressed data, ending with the uncompressed data. The buffer looks like

      |compressed sorted|empty|empty|

    • Fill the uncompressed section with numbers to be sorted. The buffer looks like

      |compressed sorted|empty|uncompressed unsorted|

    • Sort the new numbers with an in-place sort. The buffer looks like

      |compressed sorted|empty|uncompressed sorted|

    • Right-align any already compressed data from the previous iteration in the compressed section. At this point the buffer is partitioned

      |empty|compressed sorted|uncompressed sorted|

    • Perform a streaming decompression-recompression on the compressed section, merging in the sorted data in the uncompressed section. The old compressed section is consumed as the new compressed section grows. The buffer looks like

      |compressed sorted|empty|

    This procedure is performed until all numbers have been sorted.

    Compression

    This algorithm of course only works when it's possible to calculate the final compressed size of the new sorting buffer before actually knowing what will actually be compressed. Next to that, the compression algorithm needs to be good enough to solve the actual problem.

    The used approach uses three steps. First, the algorithm will always store sorted sequences, therefore we can instead store purely the differences between consecutive entries. Each difference is in the range [0, 99999999].

    These differences are then encoded as a unary bitstream. A 1 in this stream means "Add 1 to the accumulator, A 0 means "Emit the accumulator as an entry, and reset". So difference N will be represented by N 1's and one 0.

    The sum of all differences will approach the maximum value that the algorithm supports, and the count of all differences will approach the amount of values inserted in the algorithm. This means we expect the stream to, at the end, contain max value 1's and count 0's. This allows us to calculate the expected probability of a 0 and 1 in the stream. Namely, the probability of a 0 is count/(count+maxval) and the probability of a 1 is maxval/(count+maxval).

    We use these probabilities to define an arithmetic coding model over this bitstream. This arithmetic code will encode exactly this amounts of 1's and 0's in optimal space. We can calculate the space used by this model for any intermediate bitstream as: bits = encoded * log2(1 + amount / maxval) + maxval * log2(1 + maxval / amount). To calculate the total required space for the algorithm, set encoded equal to amount.

    To not require a ridiculous amount of iterations, a small overhead can be added to the buffer. This will ensure that the algorithm will at least operate on the amount of numbers that fit in this overhead, as by far the largest time cost of the algorithm is the arithmetic coding compression and decompression each cycle.

    Next to that, some overhead is necessary to store bookkeeping data and to handle slight inaccuracies in the fixed-point approximation of the arithmetic coding algorithm, but in total the algorithm is able to fit in 1MiB of space even with an extra buffer that can contain 8000 numbers, for a total of 1043916 bytes of space.

    Optimality

    Outside of reducing the (small) overhead of the algorithm it should be theoretically impossible to get a smaller result. To just contain the entropy of the final result, 1011717 bytes would be necessary. If we subtract the extra buffer added for efficiency this algorithm used 1011916 bytes to store the final result + overhead.

    0 讨论(0)
  • 2020-12-22 14:45

    If we don't know anything about those numbers, we are limited by the following constraints:

    • we need to load all numbers before we can sort them them,
    • the set of numbers is not compressible.

    If these assumptions hold, there is no way to carry out your task, as you will need at least 26,575,425 bits of storage (3,321,929 bytes).

    What can you tell us about your data ?

    0 讨论(0)
  • 2020-12-22 14:46

    I think the solution is to combine techniques from video encoding, namely the discrete cosine transformation. In digital video, rather recording the changing the brightness or colour of video as regular values such as 110 112 115 116, each is subtracted from the last (similar to run length encoding). 110 112 115 116 becomes 110 2 3 1. The values, 2 3 1 require less bits than the originals.

    So lets say we create a list of the input values as they arrive on the socket. We are storing in each element, not the value, but the offset of the one before it. We sort as we go, so the offsets are only going to be positive. But the offset could be 8 decimal digits wide which this fits in 3 bytes. Each element can't be 3 bytes, so we need to pack these. We could use the top bit of each byte as a "continue bit", indicating that the next byte is part of the number and the lower 7 bits of each byte need to be combined. zero is valid for duplicates.

    As the list fills up, the numbers should be get closer together, meaning on average only 1 byte is used to determine the distance to the next value. 7 bits of value and 1 bit of offset if convenient, but there may be a sweet spot that requires less than 8 bits for a "continue" value.

    Anyway, I did some experiment. I use a random number generator and I can fit a million sorted 8 digit decimal numbers into about 1279000 bytes. The average space between each number is consistently 99...

    public class Test {
        public static void main(String[] args) throws IOException {
            // 1 million values
            int[] values = new int[1000000];
    
            // create random values up to 8 digits lrong
            Random random = new Random();
            for (int x=0;x<values.length;x++) {
                values[x] = random.nextInt(100000000);
            }
            Arrays.sort(values);
    
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
    
            int av = 0;    
            writeCompact(baos, values[0]);     // first value
            for (int x=1;x<values.length;x++) {
                int v = values[x] - values[x-1];  // difference
                av += v;
                System.out.println(values[x] + " diff " + v);
                writeCompact(baos, v);
            }
    
            System.out.println("Average offset " + (av/values.length));
            System.out.println("Fits in " + baos.toByteArray().length);
        }
    
        public static void writeCompact(OutputStream os, long value) throws IOException {
            do {
                int b = (int) value & 0x7f;
                value = (value & 0x7fffffffffffffffl) >> 7;
                os.write(value == 0 ? b : (b | 0x80));
            } while (value != 0);
        }
    }
    
    0 讨论(0)
提交回复
热议问题