This is a Google interview question: Given 2 machines, each having 64 GB RAM, containing all integers (8 byte), sort the entire 128 GB data. You may assume a small amount of add
There are already answers for the 2 machine case.
I'm assuming that the 128GB of data to be sorted is stored as a single file on a single hard drive (or any external device). No matter how many machines or hard drives are used, the time it takes to read the original 128GB file and write the sorted 128GB file remains the same. The only savings occurs during the internal ram based sorts to create chunks of sorted data. The time it takes to merge with n+1 hard drives to do a n-way merge into a single sorted 128GB file onto the remaining hard drive again remains the same, limited by the time it takes to write the 128GB sorted file onto that remaining hard drive.
For n machines, the data would be split up into 128GB/n chunks. Each of the machines could alternate reading sub-chunks, perhaps 64MB at a time, to reduce random access overhead, so that the "last" machine isn't waiting for all of the prior machines to read all of their chunks before it even starts.
For n machines (64GB ram each) and n+1 hard drives with n >= 4, a radix sort with O(n) time complexity could be used by each machine to create 32GB or smaller chunks on the n working hard drives at the same time, followed by a n-way merge onto the destination hard drive.
There's a point of diminishing returns limiting the benefit of larger n. Somewhere beyond n > 16, the internal merge throughput could become greater than disk I/O bandwidth. If the merge process is cpu bound rather than I/O bound, there's a trade off in cpu overhead for the time it take to create chunks in parallel versus the merge overhead greater than I/O time.