问题
Hi I saw that as an interview question and thought it was an interesting question that I am not sure about the answer.
What would be the best way ?
回答1:
Assuming *nix:
system("sort <input_file >output_file")
"sort" can use temporary files to work with input files larger than memory. It has switches to tune the amount of main memory and the number of temporary files it will use, if needed.
If not *nix, or the interviewer frowns because of the sideways answer, then I'll code an external merge sort. See @psyho's answer for a good summary of an external sorting algorithm.
回答2:
Put them in a database and let the database worry about it.
回答3:
One way to do this is to use an external sorting algorithm:
- Read a chunk of file into memory
- Sort that chunk using any regular sorting algorithm (like quicksort)
- Output the sorted strings into a temporary file
- Repeat steps 1-3 until you process the whole file
- Apply the merge-sort algorithm by reading the temporary files line by line
- Profit!
回答4:
Well, this is an interesting interview question... almost all such kind of questions are meant to test your skills and don't, fortunately, directly apply to real-life examples. This looks like one, so let's get into the puzzle
When your interviewer asks for "best", I believe he/she talks about performance only.
Answer 1
30GB of strings is lot of data. All compare-swap algorithms are Omega(n logn)
, so it will take a long time. While there are O(n)
algorithms, such as counting sort, they are not in place, so you will be multiplying the 30GB and you have only 4GB of RAM (consider the swapping amount...), so I would go with quicksort
Answer 2 (partial)
Start thinking about counting sort. You may want to first split the strings in groups (using radix sort approach), one for each letter. You may want to scan the file and, for each initial letter, move the string (so copy and delete, no space waste) into a temporary file. You may want to repeat the process for the first 2, 3 or 4 chars of each string. Then, in order to reduce the complexity of sorting lots of files, you can separately sort the string within each one (using quicksort now) and finally merge all files in order. This way you'll still have a O(n logn)
but on fair lower n
回答5:
Database systems are already handling this particular problem well.
A good answer is to use the merge-sort algorithm, adapting it to spool data to and from disk as needed for the merge steps. This can be done with minimal demands on memory.
来源:https://stackoverflow.com/questions/4714043/what-is-the-best-way-to-sort-30gb-of-strings-with-a-computer-with-4gb-of-ram-usi