Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM

后端 未结 30 1899
栀梦
栀梦 2020-12-22 14:33

I have a computer with 1 MB of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the

相关标签:
30条回答
  • 2020-12-22 14:55

    I would try a Radix Tree. If you could store the data in a tree, you could then do an in-order traverse to transmit the data.

    I'm not sure you could fit this into 1MB, but I think it's worth a try.

    0 讨论(0)
  • 2020-12-22 14:56

    Please see the first correct answer or the later answer with arithmetic encoding. Below you may find some fun, but not a 100% bullet-proof solution.

    This is quite an interesting task and here is an another solution. I hope somebody would find the result useful (or at least interesting).

    Stage 1: Initial data structure, rough compression approach, basic results

    Let's do some simple math: we have 1M (1048576 bytes) of RAM initially available to store 10^6 8 digit decimal numbers. [0;99999999]. So to store one number 27 bits are needed (taking the assumption that unsigned numbers will be used). Thus, to store a raw stream ~3.5M of RAM will be needed. Somebody already said it doesn't seem to be feasible, but I would say the task can be solved if the input is "good enough". Basically, the idea is to compress the input data with compression factor 0.29 or higher and do sorting in a proper manner.

    Let's solve the compression issue first. There are some relevant tests already available:

    http://www.theeggeadventure.com/wikimedia/index.php/Java_Data_Compression

    "I ran a test to compress one million consecutive integers using various forms of compression. The results are as follows:"

    None     4000027
    Deflate  2006803
    Filtered 1391833
    BZip2    427067
    Lzma     255040
    

    It looks like LZMA (Lempel–Ziv–Markov chain algorithm) is a good choice to continue with. I've prepared a simple PoC, but there are still some details to be highlighted:

    1. Memory is limited so the idea is to presort numbers and use compressed buckets (dynamic size) as temporary storage
    2. It is easier to achieve a better compression factor with presorted data, so there is a static buffer for each bucket (numbers from the buffer are to be sorted before LZMA)
    3. Each bucket holds a specific range, so the final sort can be done for each bucket separately
    4. Bucket's size can be properly set, so there will be enough memory to decompress stored data and do the final sort for each bucket separately

    In-memory sorting

    Please note, attached code is a POC, it can't be used as a final solution, it just demonstrates the idea of using several smaller buffers to store presorted numbers in some optimal way (possibly compressed). LZMA is not proposed as a final solution. It is used as a fastest possible way to introduce a compression to this PoC.

    See the PoC code below (please note it just a demo, to compile it LZMA-Java will be needed):

    public class MemorySortDemo {
    
    static final int NUM_COUNT = 1000000;
    static final int NUM_MAX   = 100000000;
    
    static final int BUCKETS      = 5;
    static final int DICT_SIZE    = 16 * 1024; // LZMA dictionary size
    static final int BUCKET_SIZE  = 1024;
    static final int BUFFER_SIZE  = 10 * 1024;
    static final int BUCKET_RANGE = NUM_MAX / BUCKETS;
    
    static class Producer {
        private Random random = new Random();
        public int produce() { return random.nextInt(NUM_MAX); }
    }
    
    static class Bucket {
        public int size, pointer;
        public int[] buffer = new int[BUFFER_SIZE];
    
        public ByteArrayOutputStream tempOut = new ByteArrayOutputStream();
        public DataOutputStream tempDataOut = new DataOutputStream(tempOut);
        public ByteArrayOutputStream compressedOut = new ByteArrayOutputStream();
    
        public void submitBuffer() throws IOException {
            Arrays.sort(buffer, 0, pointer);
    
            for (int j = 0; j < pointer; j++) {
                tempDataOut.writeInt(buffer[j]);
                size++;
            }            
            pointer = 0;
        }
    
        public void write(int value) throws IOException {
            if (isBufferFull()) {
                submitBuffer();
            }
            buffer[pointer++] = value;
        }
    
        public boolean isBufferFull() {
            return pointer == BUFFER_SIZE;
        }
    
        public byte[] compressData() throws IOException {
            tempDataOut.close();
            return compress(tempOut.toByteArray());
        }        
    
        private byte[] compress(byte[] input) throws IOException {
            final BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(input));
            final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(compressedOut));
    
            final Encoder encoder = new Encoder();
            encoder.setEndMarkerMode(true);
            encoder.setNumFastBytes(0x20);
            encoder.setDictionarySize(DICT_SIZE);
            encoder.setMatchFinder(Encoder.EMatchFinderTypeBT4);
    
            ByteArrayOutputStream encoderPrperties = new ByteArrayOutputStream();
            encoder.writeCoderProperties(encoderPrperties);
            encoderPrperties.flush();
            encoderPrperties.close();
    
            encoder.code(in, out, -1, -1, null);
            out.flush();
            out.close();
            in.close();
    
            return encoderPrperties.toByteArray();
        }
    
        public int[] decompress(byte[] properties) throws IOException {
            InputStream in = new ByteArrayInputStream(compressedOut.toByteArray());
            ByteArrayOutputStream data = new ByteArrayOutputStream(10 * 1024);
            BufferedOutputStream out = new BufferedOutputStream(data);
    
            Decoder decoder = new Decoder();
            decoder.setDecoderProperties(properties);
            decoder.code(in, out, 4 * size);
    
            out.flush();
            out.close();
            in.close();
    
            DataInputStream input = new DataInputStream(new ByteArrayInputStream(data.toByteArray()));
            int[] array = new int[size];
            for (int k = 0; k < size; k++) {
                array[k] = input.readInt();
            }
    
            return array;
        }
    }
    
    static class Sorter {
        private Bucket[] bucket = new Bucket[BUCKETS];
    
        public void doSort(Producer p, Consumer c) throws IOException {
    
            for (int i = 0; i < bucket.length; i++) {  // allocate buckets
                bucket[i] = new Bucket();
            }
    
            for(int i=0; i< NUM_COUNT; i++) {         // produce some data
                int value = p.produce();
                int bucketId = value/BUCKET_RANGE;
                bucket[bucketId].write(value);
                c.register(value);
            }
    
            for (int i = 0; i < bucket.length; i++) { // submit non-empty buffers
                bucket[i].submitBuffer();
            }
    
            byte[] compressProperties = null;
            for (int i = 0; i < bucket.length; i++) { // compress the data
                compressProperties = bucket[i].compressData();
            }
    
            printStatistics();
    
            for (int i = 0; i < bucket.length; i++) { // decode & sort buckets one by one
                int[] array = bucket[i].decompress(compressProperties);
                Arrays.sort(array);
    
                for(int v : array) {
                    c.consume(v);
                }
            }
            c.finalCheck();
        }
    
        public void printStatistics() {
            int size = 0;
            int sizeCompressed = 0;
    
            for (int i = 0; i < BUCKETS; i++) {
                int bucketSize = 4*bucket[i].size;
                size += bucketSize;
                sizeCompressed += bucket[i].compressedOut.size();
    
                System.out.println("  bucket[" + i
                        + "] contains: " + bucket[i].size
                        + " numbers, compressed size: " + bucket[i].compressedOut.size()
                        + String.format(" compression factor: %.2f", ((double)bucket[i].compressedOut.size())/bucketSize));
            }
    
            System.out.println(String.format("Data size: %.2fM",(double)size/(1014*1024))
                    + String.format(" compressed %.2fM",(double)sizeCompressed/(1014*1024))
                    + String.format(" compression factor %.2f",(double)sizeCompressed/size));
        }
    }
    
    static class Consumer {
        private Set<Integer> values = new HashSet<>();
    
        int v = -1;
        public void consume(int value) {
            if(v < 0) v = value;
    
            if(v > value) {
                throw new IllegalArgumentException("Current value is greater than previous: " + v + " > " + value);
            }else{
                v = value;
                values.remove(value);
            }
        }
    
        public void register(int value) {
            values.add(value);
        }
    
        public void finalCheck() {
            System.out.println(values.size() > 0 ? "NOT OK: " + values.size() : "OK!");
        }
    }
    
    public static void main(String[] args) throws IOException {
        Producer p = new Producer();
        Consumer c = new Consumer();
        Sorter sorter = new Sorter();
    
        sorter.doSort(p, c);
    }
    }
    

    With random numbers it produces the following:

    bucket[0] contains: 200357 numbers, compressed size: 353679 compression factor: 0.44
    bucket[1] contains: 199465 numbers, compressed size: 352127 compression factor: 0.44
    bucket[2] contains: 199682 numbers, compressed size: 352464 compression factor: 0.44
    bucket[3] contains: 199949 numbers, compressed size: 352947 compression factor: 0.44
    bucket[4] contains: 200547 numbers, compressed size: 353914 compression factor: 0.44
    Data size: 3.85M compressed 1.70M compression factor 0.44
    

    For a simple ascending sequence (one bucket is used) it produces:

    bucket[0] contains: 1000000 numbers, compressed size: 256700 compression factor: 0.06
    Data size: 3.85M compressed 0.25M compression factor 0.06
    

    EDIT

    Conclusion:

    1. Don't try to fool the Nature
    2. Use simpler compression with lower memory footprint
    3. Some additional clues are really needed. Common bullet-proof solution does not seem to be feasible.

    Stage 2: Enhanced compression, final conclusion

    As was already mentioned in the previous section, any suitable compression technique can be used. So let's get rid of LZMA in favor of simpler and better (if possible) approach. There are a lot of good solutions including Arithmetic coding, Radix tree etc.

    Anyway, simple but useful encoding scheme will be more illustrative than yet another external library, providing some nifty algorithm. The actual solution is pretty straightforward: since there are buckets with partially sorted data, deltas can be used instead of numbers.

    encoding scheme

    Random input test shows slightly better results:

    bucket[0] contains: 10103 numbers, compressed size: 13683 compression factor: 0.34
    bucket[1] contains: 9885 numbers, compressed size: 13479 compression factor: 0.34
    ...
    bucket[98] contains: 10026 numbers, compressed size: 13612 compression factor: 0.34
    bucket[99] contains: 10058 numbers, compressed size: 13701 compression factor: 0.34
    Data size: 3.85M compressed 1.31M compression factor 0.34
    

    Sample code

      public static void encode(int[] buffer, int length, BinaryOut output) {
        short size = (short)(length & 0x7FFF);
    
        output.write(size);
        output.write(buffer[0]);
    
        for(int i=1; i< size; i++) {
            int next = buffer[i] - buffer[i-1];
            int bits = getBinarySize(next);
    
            int len = bits;
    
            if(bits > 24) {
              output.write(3, 2);
              len = bits - 24;
            }else if(bits > 16) {
              output.write(2, 2);
              len = bits-16;
            }else if(bits > 8) {
              output.write(1, 2);
              len = bits - 8;
            }else{
              output.write(0, 2);
            }
    
            if (len > 0) {
                if ((len % 2) > 0) {
                    len = len / 2;
                    output.write(len, 2);
                    output.write(false);
                } else {
                    len = len / 2 - 1;
                    output.write(len, 2);
                }
    
                output.write(next, bits);
            }
        }
    }
    
    public static short decode(BinaryIn input, int[] buffer, int offset) {
        short length = input.readShort();
        int value = input.readInt();
        buffer[offset] = value;
    
        for (int i = 1; i < length; i++) {
            int flag = input.readInt(2);
    
            int bits;
            int next = 0;
            switch (flag) {
                case 0:
                    bits = 2 * input.readInt(2) + 2;
                    next = input.readInt(bits);
                    break;
                case 1:
                    bits = 8 + 2 * input.readInt(2) +2;
                    next = input.readInt(bits);
                    break;
                case 2:
                    bits = 16 + 2 * input.readInt(2) +2;
                    next = input.readInt(bits);
                    break;
                case 3:
                    bits = 24 + 2 * input.readInt(2) +2;
                    next = input.readInt(bits);
                    break;
            }
    
            buffer[offset + i] = buffer[offset + i - 1] + next;
        }
    
       return length;
    }
    

    Please note, this approach:

    1. does not consume a lot of memory
    2. works with streams
    3. provides not so bad results

    Full code can be found here, BinaryInput and BinaryOutput implementations can be found here

    Final conclusion

    No final conclusion :) Sometimes it is really good idea to move one level up and review the task from a meta-level point of view.

    It was fun to spend some time with this task. BTW, there are a lot of interesting answers below. Thank you for your attention and happy codding.

    0 讨论(0)
  • 2020-12-22 14:56

    The trick is to represent the algorithms state, which is an integer multi-set, as a compressed stream of "increment counter"="+" and "output counter"="!" characters. For example, the set {0,3,3,4} would be represented as "!+++!!+!", followed by any number of "+" characters. To modify the multi-set you stream out the characters, keeping only a constant amount decompressed at a time, and make changes inplace before streaming them back in compressed form.

    Details

    We know there are exactly 10^6 numbers in the final set, so there are at most 10^6 "!" characters. We also know that our range has size 10^8, meaning there are at most 10^8 "+" characters. The number of ways we can arrange 10^6 "!"s amongst 10^8 "+"s is (10^8 + 10^6) choose 10^6, and so specifying some particular arrangement takes ~0.965 MiB` of data. That'll be a tight fit.

    We can treat each character as independent without exceeding our quota. There are exactly 100 times more "+" characters than "!" characters, which simplifies to 100:1 odds of each character being a "+" if we forget that they are dependent. Odds of 100:101 corresponds to ~0.08 bits per character, for an almost identical total of ~0.965 MiB (ignoring the dependency has a cost of only ~12 bits in this case!).

    The simplest technique for storing independent characters with known prior probability is Huffman coding. Note that we need an impractically large tree (A huffman tree for blocks of 10 characters has an average cost per block of about 2.4 bits, for a total of ~2.9 Mib. A huffman tree for blocks of 20 characters has an average cost per block of about 3 bits, which is a total of ~1.8 MiB. We're probably going to need a block of size on the order of a hundred, implying more nodes in our tree than all the computer equipment that has ever existed can store.). However, ROM is technically "free" according to the problem and practical solutions that take advantage of the regularity in the tree will look essentially the same.

    Pseudo-code

    • Have a sufficiently large huffman tree (or similar block-by-block compression data) stored in ROM
    • Start with a compressed string of 10^8 "+" characters.
    • To insert the number N, stream out the compressed string until N "+" characters have gone past then insert a "!". Stream the recompressed string back over the previous one as you go, keeping a constant amount of buffered blocks to avoid over/under-runs.
    • Repeat one million times: [input, stream decompress>insert>compress], then decompress to output
    0 讨论(0)
  • 2020-12-22 15:02

    There is one rather sneaky trick not mentioned here so far. We assume that you have no extra way to store data, but that is not strictly true.

    One way around your problem is to do the following horrible thing, which should not be attempted by anyone under any circumstances: Use the network traffic to store data. And no, I don't mean NAS.

    You can sort the numbers with only a few bytes of RAM in the following way:

    • First take 2 variables: COUNTER and VALUE.
    • First set all registers to 0;
    • Every time you receive an integer I, increment COUNTER and set VALUE to max(VALUE, I);
    • Then send an ICMP echo request packet with data set to I to the router. Erase I and repeat.
    • Every time you receive the returned ICMP packet, you simply extract the integer and send it back out again in another echo request. This produces a huge number of ICMP requests scuttling backward and forward containing the integers.

    Once COUNTER reaches 1000000, you have all of the values stored in the incessant stream of ICMP requests, and VALUE now contains the maximum integer. Pick some threshold T >> 1000000. Set COUNTER to zero. Every time you receive an ICMP packet, increment COUNTER and send the contained integer I back out in another echo request, unless I=VALUE, in which case transmit it to the destination for the sorted integers. Once COUNTER=T, decrement VALUE by 1, reset COUNTER to zero and repeat. Once VALUE reaches zero you should have transmitted all integers in order from largest to smallest to the destination, and have only used about 47 bits of RAM for the two persistent variables (and whatever small amount you need for the temporary values).

    I know this is horrible, and I know there can be all sorts of practical issues, but I thought it might give some of you a laugh or at least horrify you.

    0 讨论(0)
  • 2020-12-22 15:02

    If the input stream could be received few times this would be much easier (no information about that, idea and time-performance problem).

    Then, we could count the decimal values. With counted values it would be easy to make the output stream. Compress by counting the values. It depends what would be in the input stream.

    0 讨论(0)
  • 2020-12-22 15:03

    A radix tree representation would come close to handling this problem, since the radix tree takes advantage of "prefix compression". But it's hard to conceive of a radix tree representation that could represent a single node in one byte -- two is probably about the limit.

    But, regardless of how the data is represented, once it is sorted it can be stored in prefix-compressed form, where the numbers 10, 11, and 12 would be represented by, say 001b, 001b, 001b, indicating an increment of 1 from the previous number. Perhaps, then, 10101b would represent an increment of 5, 1101001b an increment of 9, etc.

    0 讨论(0)
提交回复
热议问题