shuffle a large list of items without loading in memory

后端未结

关注

 6  1854

礼貌的吻别 2021-01-07 20:48

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat

6条回答

醉梦人生 (楼主)

2021-01-07 21:22
You may check my HugeFileProcessor tool. It's similar to @Alex-Reynolds's sample, but should be significantly faster as there would be no seeks.

Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.

The algorithm is as follows.
1. Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
2. As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
3. Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
  - We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
  - Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?

Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.

Some numbers

On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...