How to sort millions of rows of data in a file with less/meagre memory

后端 未结 9 500
不思量自难忘°
不思量自难忘° 2021-02-01 06:44

(From here)

I attended an interview last week and this question was asked:

How do you sort a billion rows of data in a file with only 640KB of memory in

相关标签:
9条回答
  • 2021-02-01 07:26

    If speed is not a requirement, you could bubble sort rows in place in the file. This only requires looking at two rows of data at a time, with no external information or storage required.

    0 讨论(0)
  • 2021-02-01 07:28

    Knuth has a whole section on external sorting; this was commonplace back when there were no hard drives & not much memory, and tape drives were the norm. Look at the wikipedia page, and/or vol. 3 of Knuth's Art of Computer Programming.

    I agree with Robusto's comment:

    Where do you get the file from if you can't use the drive? It's certainly not going to be held in memory.

    Not enough problem definition.

    0 讨论(0)
  • 2021-02-01 07:28

    You can find the discussion on a similar problem in Jon Bentley Programming Pearls Column. 1. Here Bentley deals with a problem of sorting millions of area codes which are guaranteed to be unique by using a bitset data-structure.

    0 讨论(0)
  • 2021-02-01 07:31

    Obviously you have to be able to read and write to the billion row file. The constraint of no external disk means you must restrict yourself to in-place algorithms or make some assumptions about the starting conditions and distribution of data so that you can keep the data sorted as it is added to the file (e.g. use the key as the index and create a large enough file to hold the expected number of keys).

    If you must start with an unsorted file and sort it, you can use merge an in-place merge sort operating on very small chunks of the file. Since no constraints are made on the access times of the storage media, it may be very fast.

    0 讨论(0)
  • 2021-02-01 07:31

    I'd use the GPU! Even on a fast computer, the GPU is often faster at sorting. And I don't know how big the "rows" are, but it's not hard to find 1GB video cards, so that answers the storage question, too.

    Besides, if I had to work on an 8080, I'd definitely want to put the sweetest graphics card I could find on there.

    You just have to be ready for the follow-up question: "How do you get an 8080 to talk to a modern PCI Express 2.0 x16 card?". I have discovered a truly marvelous method, but this textarea is too narrow to contain it.

    0 讨论(0)
  • 2021-02-01 07:36

    Another question to have asked, is "What is the nature of the rows?" If the number of distinct values is low enough, then the answer might be a pigeon hole sort.

    For example, say the file to be sorted only contained rows that held a number between 0 and 100 inclusive. Create an array of 101 unsigned 32 bit or 64 bit integers with a value of 0. As you read a row, use it to index the array and increament the count of that element. Once the file is read, start at 0, read the the number of zeros read and spit out that many, go to 1, repeat. Expand the array size as needed to handle the set of numbers coming through. Of course there are limits, say the values that can be seen span from -2e9 to +2e9. That's going to require 4e9 bins, which is not going to fit in 640K of RAM.

    If instead the rows are strings, but you are still looking at a small enough set of distinct value, then use an associative array or hash table to hold the counts.

    0 讨论(0)
提交回复
热议问题