问题
I am currently working on building a classifier using c5.0. I have a dataset of 8000 entries and each entry has its own i.d number (1-8000). When testing the performance of the classifier I had to make 5sets of 10:90 (training data: test data) splits. Of course any training cases cannot appear again in the test cases, and duplicates cannot occur in either set.
To solve the problem of picking examples at random for the training data, and making sure the same cannot be picked for the test data I have developed a horribly slow method;
fill a file with numbers from 1-8000 on separate lines.
randomly pick a line number (from a range of 1-8000) and use the contents of the line as the id number of the training example.
write all unpicked numbers to a new file
decrement the range of the random number generator by 1
redo
Then all unpicked numbers are used as test data. It works but its slow. To speed things up I could use List::Util 'shuffle' to just 'randomly' shuffle and array of these numbers. But how random is 'shuffle'? It is essential that the same level of accuracy is maintained. Sorry about the essay, but does anyone know how 'shuffle' actually works. Any help at all would be great
回答1:
Here is the shuffle algorithm used in List::Util::PP
sub shuffle (@) {
my @a=\(@_);
my $n;
my $i=@_;
map {
$n = rand($i--);
(${$a[$n]}, $a[$n] = $a[$i])[0];
} @_;
}
Which looks like a Fisher-Yates shuffle.
来源:https://stackoverflow.com/questions/5168104/how-does-listutil-shuffle-actually-work