If I want to sample numbers to create a vector I do:
set.seed(123)
x <- sample(1:100,200, replace = TRUE)
sum(x)
# [1] 10228
What if I
Another approach, but with floating point numbers, so not exactly what you are looking for, sorry :
randomsum <- function(nb, sum) {
tmp <- sort(runif(nb-1))
tmp <- c(min(tmp), diff(tmp), 1-max(tmp))
as.vector(quantile(0:sum, probs=tmp))
}
Which would give for example :
R> result <- randomsum(10, 1000)
R> result
[1] 35.282191 66.537308 17.263761 182.837409 120.064363 210.752735
[7] 143.201079 6.164731 34.936359 182.960064
R> sum(result)
[1] 1000
You could use round
on the result to get integers, but then of course the sum could be slightly different from what you want to get. A quick and dirty workaround can be to alter one of the random values to make your vector sum to the number you want :
randomsumint <- function(nb, sum) {
tmp <- sort(runif(nb-1))
tmp <- c(min(tmp), diff(tmp), 1-max(tmp))
res <- as.vector(quantile(0:sum, probs=tmp))
res <- round(res)
res[length(res)] <- res[length(res)]+(sum-sum(res))
res
}
Which would give :
R> result <- randomsumint(10,1000)
R> result
[1] 42 152 0 11 74 138 9 138 172 264
R> sum(result)
[1] 1000
Not that this is far from perfect, as in some rare cases you could get a negative value in your result.
Here's another attempt. It doesn't use sample
, but uses runif
. I've added an optional "message" to the output showing the sum, which can be triggered using the showSum
argument. There is also a Tolerance
argument that specifies how close to the target is required.
SampleToSum <- function(Target = 100, VecLen = 10,
InRange = 1:100, Tolerance = 2,
showSum = TRUE) {
Res <- vector()
while ( TRUE ) {
Res <- round(diff(c(0, sort(runif(VecLen - 1)), 1)) * Target)
if ( all(Res > 0) &
all(Res >= min(InRange)) &
all(Res <= max(InRange)) &
abs((sum(Res) - Target)) <= Tolerance ) { break }
}
if (isTRUE(showSum)) cat("Total = ", sum(Res), "\n")
Res
}
Here are some examples.
Notice the difference between the default setting and setting Tolerance = 0
set.seed(1)
SampleToSum()
# Total = 101
# [1] 20 6 11 20 6 3 24 1 4 6
SampleToSum(Tolerance=0)
# Total = 100
# [1] 19 15 4 10 1 11 7 16 4 13
You can verify this behavior by using replicate
. Here's the result of setting Tolerance = 0
and running the function 5 times.
system.time(output <- replicate(5, SampleToSum(
Target = 1376,
VecLen = 13,
InRange = 10:200,
Tolerance = 0)))
# Total = 1376
# Total = 1376
# Total = 1376
# Total = 1376
# Total = 1376
# user system elapsed
# 0.144 0.000 0.145
output
# [,1] [,2] [,3] [,4] [,5]
# [1,] 29 46 11 43 171
# [2,] 103 161 113 195 197
# [3,] 145 134 91 131 147
# [4,] 154 173 138 19 17
# [5,] 197 62 173 11 87
# [6,] 101 142 87 173 99
# [7,] 168 61 97 40 121
# [8,] 140 121 99 135 117
# [9,] 46 78 31 200 79
# [10,] 140 168 146 17 56
# [11,] 21 146 117 182 85
# [12,] 63 30 180 179 78
# [13,] 69 54 93 51 122
And the same for setting Tolerance = 5
and running the function 5 times.
system.time(output <- replicate(5, SampleToSum(
Target = 1376,
VecLen = 13,
InRange = 10:200,
Tolerance = 5)))
# Total = 1375
# Total = 1376
# Total = 1374
# Total = 1374
# Total = 1376
# user system elapsed
# 0.060 0.000 0.058
output
# [,1] [,2] [,3] [,4] [,5]
# [1,] 65 190 103 15 47
# [2,] 160 95 98 196 183
# [3,] 178 169 134 15 26
# [4,] 49 53 186 48 41
# [5,] 104 81 161 171 180
# [6,] 54 126 67 130 182
# [7,] 34 131 49 113 76
# [8,] 17 21 107 62 95
# [9,] 151 136 132 195 169
# [10,] 194 187 91 163 22
# [11,] 23 69 54 97 30
# [12,] 190 14 134 43 150
# [13,] 156 104 58 126 175
Not surprisingly, setting the tolerance to 0 would make the function slower.
Note that since this is a "random" process, it's hard to guess how long it would take to find the right combination of numbers. For example, using set.seed(123)
, I ran the following test three times in a row:
system.time(SampleToSum(Target = 1163,
VecLen = 15,
InRange = 50:150))
The first run took just over 9 seconds. The second took just over 7.5 seconds. The third took... just under 381 seconds! That's a lot of variation!
Out of curiosity, I added a counter into the function, and the first run took 55026 attempts to arrive at a vector that satisfied all of our conditions! (I didn't bother trying for the second and third attempts.)
It might be good to add some error or sanity checking into the function to make sure the inputs are reasonable. For example, one should not be able to enter SampleToSum(Target = 100, VecLen = 10, InRange = 15:50)
since with a range of 15 to 50, there's no way to get to 100 AND have 10 values in your vector.
I thought of stars and bars and partitions in combinatorics:
foo <- function(n,total) {
while(!exists("x",inherits=FALSE) || 1 %in% diff(x)) {
x <- sort(c(0,sample.int(n+total,n-1,replace=FALSE),n+total))
}
print(x)
sort(diff(x)-1)
}
Another method is to use the partitions package. This is more suited to enumerating all partitions, but it is okay for now. It works as long as your total number is small.
require(partitions)
foo <- function(n,total) {
x <- restrictedparts(total,n,include.zero=FALSE)
return(x[,sample.int(ncol(x),1)])
}
Assuming that you want integers (if not then look at the Dirichlet distribution) then this can be thought of as a ball and urn problem (without further restrictions on relationship between the numbers).
If you want 20 numbers then that can be represented by 20 urns. You want the numbers to sum to 100 so that is 100 balls. Since you want exactly 20 numbers (skip this step if you want up to 20 numbers, but could be fewer) you start by placing 1 ball in each urn, then randomly distribute the remaining balls between the urns. Count the number of balls in each urn and you will have 20 numbers that sum to 100.
As R code:
as.vector(table( c( 1:20, sample(1:20, 80, replace=TRUE) ) ))
The as.vector
just strips off the table class and labels.
Quick, simple, exact, no loops, recursion, etc.
For other totals or number of values just change the appropriate pieces above.
An attempt using R
# Config
n <- 20L
target <- 100L
vec <- seq(100)
set.seed(123)
# R repeat loop
sumto_repeat <- function(vec,n,target) {
res <- integer()
repeat {
cat("begin:",sum(res),length(res),"\n")
res <- c( res, sample(vec,1) )
if( sum(res)<target & length(res)==(n-1) ) {
res[length(res)+1] <- target - sum(res)
}
# cat("mid:",sum(res),length(res),"\n")
if(sum(res)>target) res <- res[-length(res)]
if( length(res)>n | length(res)<n & sum(res)==target ) {
res <- res[-sample(seq(length(res)),1)]
}
# cat("end:",sum(res),length(res),"\n")
# cat(dput(res),"\n")
if( sum(res)==target & length(res)==n ) break
}
res
}
test <- sumto_repeat(vec=vec,n=n,target=target)
> sum(test)
[1] 100
> length(test)
[1] 20
Also, I'd give some thought to what distribution you'd like to be drawing from. I think that there are a few different ways of getting it to sum to exactly target
with n
elements (for instance, you could make the last element always be target - sum(res)
) that may or may not have different distributional implications.
A very similar algorithm in Rcpp, for speeeeed!
cpp_src <- '
Rcpp::IntegerVector xa = clone(x); // Vector to be sampled
Rcpp::IntegerVector na(n); // Number of elements in solution
Rcpp::IntegerVector sa(s); // Sum of solution
int nsampled;
int currentSum;
int dropRandomIndex;
int numZeroes;
Rcpp::IntegerVector remainingQuantity(1);
int maxAttempts = 100;
// Create container for our results
Rcpp::IntegerVector res(maxAttempts);
std::fill( res.begin(), res.end(), NA_INTEGER );
// Calculate min/max so that we can draw random integers from within range
Rcpp::IntegerVector::iterator mn = std::min_element(xa.begin(), xa.end()) ;
Rcpp::IntegerVector::iterator mx = std::max_element(xa.begin(), xa.end()) ;
std::cout << "mx = " << *mx << std::endl;
// Now draw repeatedly
nsampled = 0;
for( int i = 0; i < maxAttempts; i++ ) {
std::cout << "\\n" << i;
int r = *mn + (rand() % (int)(*mx - *mn + 1));
res[i] = xa[r+1];
// Calculate n and s for current loop iteration
numZeroes = 0;
for( int j = 0; j < maxAttempts; j++)
if(res[j]==0) numZeroes++;
std::cout << " nz= " << numZeroes ;
nsampled = maxAttempts - sum( is_na(res) ) - numZeroes - 1;
currentSum = std::accumulate(res.begin(),res.begin()+i,0); // Cant just use Rcpp sugar sum() here because it freaks at the NAs
std::cout << " nsamp= " << nsampled << " sum= " << currentSum;
if(nsampled == na[0]-1) {
std::cout << " One element away. ";
remainingQuantity[0] = sa[0] - currentSum;
std::cout << "remainingQuantity = " << remainingQuantity[0];
if( (remainingQuantity[0] > 0) && (remainingQuantity[0]) < *mx ) {
std::cout << "Within range. Prepare the secret (cheating) weapon!\\n";
std::cout << sa[0] << " ";
std::cout << currentSum << " ";
std::cout << remainingQuantity[0] << std::endl;
if( i != maxAttempts ) {
std::cout << "Safe to add one last element on the end. Doing so.\\n";
res[i] = remainingQuantity[0];
}
currentSum = sa[0];
nsampled++;
if(nsampled == na[0] && currentSum == sa[0]) std::cout << "It should end after this...nsamp= " << nsampled << " and currentSum= " << currentSum << std::endl;
break;
} else {
std::cout << "Out of striking distance. Dropping random element\\n";
dropRandomIndex = 0 + (rand() % (int)(i - 0 + 1));
res[dropRandomIndex] = 0;
}
}
if(nsampled == na[0] && currentSum == sa[0]) {
std::cout << "Success!\\n";
for(int l = 0; l <= i+1; l++)
std::cout << res[l] << " " ;
break;
}
if(nsampled == na[0] && currentSum != sa[0]) {
std::cout << "Reached number of elements but sum is ";
if(currentSum > sa[0]) {
std::cout << "Too high. Blitz everything and start over!\\n";
for(int k = 0; k < res.size(); k++) {
res[k] = NA_INTEGER;
}
} else {
std::cout << "Too low. \\n";
}
}
if( nsampled < na[0] && currentSum >= sa[0] ) {
std::cout << "Too few elements but at or above the sum cutoff. Dropping a random element and trying again.\\n";
dropRandomIndex = 0 + (rand() % (int)(i - 0 + 1));
res[dropRandomIndex] = 0;
}
}
return res;
'
sumto <- cxxfunction( signature(x="integer", n="integer", s="integer"), body=cpp_src, plugin="Rcpp", verbose=TRUE )
testresult <- sumto(x=x, n=20L, s=1000L)
testresult <- testresult[!is.na(testresult)]
testresult <- testresult[testresult!=0]
testresult
cumsum(testresult)
length(testresult)
Tried it with a few different values, and produces valid answers unless it runs away. There's a caveat here, which is that it cheats if it's one away from the desired number of elements and within "striking distance" -- e.g. rather than just drawing the last value it calculates it if that number is valid.
Benchmarks
See gist for comparison code.