What is the best way to sort 30gb of strings with a computer with 4gb of RAM using Ruby as scripting language?

点点圈 提交于 2019-12-23 09:07:42

问题


Hi I saw that as an interview question and thought it was an interesting question that I am not sure about the answer.

What would be the best way ?


回答1:


Assuming *nix:

system("sort <input_file >output_file")

"sort" can use temporary files to work with input files larger than memory. It has switches to tune the amount of main memory and the number of temporary files it will use, if needed.

If not *nix, or the interviewer frowns because of the sideways answer, then I'll code an external merge sort. See @psyho's answer for a good summary of an external sorting algorithm.




回答2:


Put them in a database and let the database worry about it.




回答3:


One way to do this is to use an external sorting algorithm:

  1. Read a chunk of file into memory
  2. Sort that chunk using any regular sorting algorithm (like quicksort)
  3. Output the sorted strings into a temporary file
  4. Repeat steps 1-3 until you process the whole file
  5. Apply the merge-sort algorithm by reading the temporary files line by line
  6. Profit!



回答4:


Well, this is an interesting interview question... almost all such kind of questions are meant to test your skills and don't, fortunately, directly apply to real-life examples. This looks like one, so let's get into the puzzle

When your interviewer asks for "best", I believe he/she talks about performance only.

Answer 1

30GB of strings is lot of data. All compare-swap algorithms are Omega(n logn), so it will take a long time. While there are O(n) algorithms, such as counting sort, they are not in place, so you will be multiplying the 30GB and you have only 4GB of RAM (consider the swapping amount...), so I would go with quicksort

Answer 2 (partial)

Start thinking about counting sort. You may want to first split the strings in groups (using radix sort approach), one for each letter. You may want to scan the file and, for each initial letter, move the string (so copy and delete, no space waste) into a temporary file. You may want to repeat the process for the first 2, 3 or 4 chars of each string. Then, in order to reduce the complexity of sorting lots of files, you can separately sort the string within each one (using quicksort now) and finally merge all files in order. This way you'll still have a O(n logn) but on fair lower n




回答5:


Database systems are already handling this particular problem well.

A good answer is to use the merge-sort algorithm, adapting it to spool data to and from disk as needed for the merge steps. This can be done with minimal demands on memory.



来源:https://stackoverflow.com/questions/4714043/what-is-the-best-way-to-sort-30gb-of-strings-with-a-computer-with-4gb-of-ram-usi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!