问题
I have a simple question regarding the sample function in R. I'm randomly sampling from 0s and 1s and summing them together, from an input vector of length 5, which designates the number of trials to run and sets the seed to generate reproducible random numbers. Seed works as expected, but I get different matrices of random numbers depending on what I put in the prob statement. In this case I assumed prob=NULL should be the same as prob=c(0.5,0.5). Why isn't it?
vn<-c(12, 44, 9, 17, 28)
> do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(0,1),size=Y,replace=T)), simplify=TRUE)}))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6 7 7 6 6 9 3 6 2 5
[2,] 22 21 20 29 22 24 24 19 25 19
[3,] 4 8 3 5 4 4 4 6 4 2
[4,] 8 4 12 9 11 7 9 10 8 8
[5,] 13 9 11 14 12 14 10 13 11 12
> do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(0,1),size=Y,replace=T, prob=c(0.5,0.5))), simplify=TRUE)}))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6 5 5 6 6 3 9 6 10 7
[2,] 22 23 24 15 22 20 20 25 19 25
[3,] 5 1 6 4 5 5 5 3 5 7
[4,] 9 13 5 8 6 10 8 7 9 9
[5,] 15 19 17 14 16 14 18 15 17 16
UPDATE:
I extended the samplings to 100, with an input vector
vn<-seq(0,100,5)
and compared the rowMeans of the output matrices without prob (test1) and with prob=c(0.5,0.5) against expected mean. Interestingly, test1 and test2 are off by the exact same amount by reversed signs. Why is that? Thanks!
> rowMeans(test1)-seq(0,100,5)/2
[1] 0.00 -0.07 -0.01 -0.35 -0.07 0.19 -0.07 0.24 0.21 0.46 0.20 0.50 -0.37 -0.35 0.00 0.64 -0.59 0.63 -1.19 0.44 -0.38
> rowMeans(test2)-seq(0,100,5)/2
[1] 0.00 0.07 0.01 0.35 0.07 -0.19 0.07 -0.24 -0.21 -0.46 -0.20 -0.50 0.37 0.35 0.00 -0.64 0.59 -0.63 1.19 -0.44 0.38
回答1:
As suggested by Randy, different routines are used by sample.int
depending on whether prop
is NULL.
In your case, it returns inverse results:
> set.seed(1); sample(c(0,1), size=20, replace=TRUE)
[1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1
> set.seed(1); sample(c(0,1), size=20, replace=TRUE, prob=c(.5,.5))
[1] 1 1 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0
What's going on?
For the former, we hit line src/main/random.c:546
:
for (int i = 0; i < k; i++) iy[i] = (int)(dn * unif_rand() + 1);
This one is simple. unif_rand()
returns a value between 0 and 1 (and will never return 1), dn
is 2 (the number of elements in x
) so iy[i]
is set to 1
or 2
, depending on whether unif_rand()
returns a value < .5
or >= .5
respectively; this is the value chosen from x
.
The latter is a bit more complex. Because prob
is specified, do_sample
calls the function ProbSampleReplace
at src/main/random.c:309
. Here, the probabilities are sorted into descending order with the function revsort
at src/main/sort.c:248
. This uses a heap sort on the probabilities, and with a two-element vector of equal probabilities, it reverses the order.
ProbSampleReplace
again calls unif_rand()
but this time it maps it to the cumulative probabilities computed after flipping the order of the vector, so if unif_rand()
returns a value < 0.5
the second value is returned (1
in your example). This is the code that does the mapping of unif_rand()
to the values in x
:
/* compute the sample */
for (i = 0; i < nans; i++) {
rU = unif_rand();
for (j = 0; j < nm1; j++) {
if (rU <= p[j])
break;
}
ans[i] = perm[j];
}
So with equal probabilities of two elements, setting the probability explicitly to c(0.5, 0.5)
will return the inverse of the same call without setting the probabilities. With more than two elements, it's not going to always reverse them, but it won't return the same order.
This also explains why Fernando's suggestion works. The values are close enough to .5 as to not change the results for this example, and the heap sort returns the values in the original order.
This expression returns the same matrix as your first line of code:
do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(1,0),size=Y,replace=T, prob=c(0.5,0.5))), simplify=TRUE)}))
Here, the order of the entries in x
have been reversed to account for the two-element sort of equal values (which swaps the entries).
Of course this is all academic. In practice, permuting the order of equiprobable entries doesn't matter.
Source files and line numbers above refer to R 3.0.2
.
回答2:
I updated my comment to an answer. sample
uses different c routines for uniform sampling and weighted sampling. Though you are using equal weights, R will call the weighted sampling anyway.
To see this, consider
> set.seed(1)
> sample.int(100)
[1] 27 37 57 89 20 86 97 62 58 6 19 16 61 34 67 43 88 83
[19] 32 63 75 17 51 10 21 29 1 28 81 25 87 42 70 13 55 44
[37] 78 7 45 26 50 39 46 82 30 65 2 84 59 36 24 85 22 12
[55] 4 5 14 23 73 79 99 47 18 95 60 77 41 53 3 69 11 71
[73] 35 31 40 49 76 9 38 64 80 66 8 91 33 92 100 54 98 94
[91] 52 74 68 72 93 15 56 48 90 96
> set.seed(1)
> sample.int(100, prob = rep(1/100, 100))
[1] 28 39 60 93 21 91 96 67 63 7 22 18 71 41 79 51 74 1
[19] 38 78 94 20 64 12 29 40 2 42 87 35 50 61 52 17 84 69
[37] 81 10 73 44 85 65 80 54 49 82 4 46 75 68 43 90 36 23
[55] 8 11 30 55 66 34 97 26 47 31 70 24 53 86 6 95 32 89
[73] 27 33 56 98 88 25 77 100 37 62 19 15 76 13 59 5 14 9
[91] 45 3 83 99 72 58 48 57 92 16
Note that the two different sampled sequences.
来源:https://stackoverflow.com/questions/23316729/r-sample-probabilities-default-is-equal-weight-why-does-specifying-equal-weigh