bash - shuffle a file that is too large to fit in memory

前端未结

关注

 6  1062

I\'ve got a file that\'s too large to fit in memory. shuf seems to run in RAM, and sort -R doesn\'t shuffle (identical lines end up next to each ot

相关标签:

6条回答

野性不改

2021-01-01 20:44
First of all, I would say, this is not a strict global shuffle solution.

Generally, my idea is to split the large file into smaller ones, and then do the shuffle.
1. Split large file into pieces:
split -bytes=500M large_file small_file_

This will split your large_file into small_file_aa, small_file_ab....
1. Shuffle:
shuf small_file_aa > small_file_aa.shuf

You may try to blend the files several times to get a result approximate to global shuffle.
0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2021-01-01 20:44

How about: perl <large-input-file -lne 'print rand(), "\t", $_' | sort | perl -lpe 's/^.*?\t//' >shuffled-output-file

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2021-01-01 20:57

Count lines (wc -l) and generate a list of numbers corresponding to line numbers, in a random order - perhaps by generating a list of numbers in a temp file (use /tmp/, which is in RAM typically, and thus relatively fast). Then copy the line corresponding to each number to the target file in the order of the shuffled numbers.

This would be time-inefficient, because of the amount of seeking for newlines in the file, but it would work on almost any size of file.

0 讨论(0)
发布评论:

提交评论
- 加载中...
佛祖请我去吃肉

2021-01-01 20:57

Have a look at https://github.com/alexandres/terashuf . From page:

terashuf implements a quasi-shuffle algorithm for shuffling multi-terabyte text files using limited memory

0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2021-01-01 21:02
Using a form of decorate-sort-undecorate pattern and awk you can do something like:
```
$ seq 10 | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
8
5
1
9
6
3
7
2
10
4
```
For a file, you would do:
```
$ awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' SORTED.TXT | sort -n | cut -c8- > SHUFFLED.TXT
```
or cat the file at the start of the pipeline.

This works by generating a column of random numbers between 000000 and 999999 inclusive (decorate); sorting on that column (sort); then deleting the column (undecorate). That should work on platforms where sort does not understand numerics by generating a column with leading zeros for lexicographic sorting.

You can increase that randomization, if desired, in several ways:
1. If your platform's sort understands numerical values (POSIX, GNU and BSD do) you can do awk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2- to use a near double float for random representation.
2. If you are limited to a lexicographic sort, just combine two calls to rand into one column like so: awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2- which gives a composite 12 digits of randomization.
0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2021-01-01 21:02
If the file is within a few orders of magnitude of what can fit in memory, one option is to randomly distribute the lines among (say) 1000 temporary files, then shuffle each of those files and concatenate the result:
```
perl -we ' my $NUM_FILES = 1000;
           my @fhs;
           for (my $i = 0; $i < $NUM_FILES; ++$i) {
             open $fh[$i], "> tmp.$i.txt"
               or die "Error opening tmp.$i.txt: $!";
           }
           while (<>) {
             $fh[int rand $NUM_FILES]->print($_);
           }
           foreach my $fh (@fhs) {
             close $fh;
           }
         ' < input.txt \
&& \
for tmp_file in tmp.*.txt ; do
  shuf ./"$tmp_file" && rm ./"$tmp_file"
done > output.txt
```
(Of course, there will be some variation in the sizes of the temporary files — they won't all be exactly one-thousandth the size of the original file — so if you use this approach, you need to give yourself some buffer by erring on the side of more, smaller files.)
0 讨论(0)
发布评论:

提交评论
- 加载中...