Sample with a max

前端 未结 5 1189
日久生厌
日久生厌 2021-01-19 01:57

If I want to sample numbers to create a vector I do:

set.seed(123)
x <- sample(1:100,200, replace = TRUE)
sum(x)
# [1] 10228

What if I

相关标签:
5条回答
  • 2021-01-19 02:31

    Another approach, but with floating point numbers, so not exactly what you are looking for, sorry :

    randomsum <- function(nb, sum) {
      tmp <- sort(runif(nb-1))
      tmp <- c(min(tmp), diff(tmp), 1-max(tmp))
      as.vector(quantile(0:sum, probs=tmp))
    }
    

    Which would give for example :

    R> result <- randomsum(10, 1000)
    R> result
     [1]  35.282191  66.537308  17.263761 182.837409 120.064363 210.752735
     [7] 143.201079   6.164731  34.936359 182.960064
    R> sum(result)
    [1] 1000
    

    You could use round on the result to get integers, but then of course the sum could be slightly different from what you want to get. A quick and dirty workaround can be to alter one of the random values to make your vector sum to the number you want :

    randomsumint <- function(nb, sum) {
      tmp <- sort(runif(nb-1))
      tmp <- c(min(tmp), diff(tmp), 1-max(tmp))
      res <- as.vector(quantile(0:sum, probs=tmp))
      res <- round(res)
      res[length(res)] <- res[length(res)]+(sum-sum(res))
      res
    }
    

    Which would give :

    R> result <- randomsumint(10,1000)
    R> result
     [1]  42 152   0  11  74 138   9 138 172 264
    R> sum(result)
    [1] 1000
    

    Not that this is far from perfect, as in some rare cases you could get a negative value in your result.

    0 讨论(0)
  • 2021-01-19 02:34

    Here's another attempt. It doesn't use sample, but uses runif. I've added an optional "message" to the output showing the sum, which can be triggered using the showSum argument. There is also a Tolerance argument that specifies how close to the target is required.

    SampleToSum <- function(Target = 100, VecLen = 10, 
                            InRange = 1:100, Tolerance = 2, 
                            showSum = TRUE) {
      Res <- vector()
      while ( TRUE ) {
        Res <- round(diff(c(0, sort(runif(VecLen - 1)), 1)) * Target)
        if ( all(Res > 0)  & 
             all(Res >= min(InRange)) &
             all(Res <= max(InRange)) &
             abs((sum(Res) - Target)) <= Tolerance ) { break }
      }
      if (isTRUE(showSum)) cat("Total = ", sum(Res), "\n")
      Res
    }
    

    Here are some examples.

    Notice the difference between the default setting and setting Tolerance = 0

    set.seed(1)
    SampleToSum()
    # Total =  101 
    #  [1] 20  6 11 20  6  3 24  1  4  6
    SampleToSum(Tolerance=0)
    # Total =  100 
    #  [1] 19 15  4 10  1 11  7 16  4 13
    

    You can verify this behavior by using replicate. Here's the result of setting Tolerance = 0 and running the function 5 times.

    system.time(output <- replicate(5, SampleToSum(
      Target = 1376,
      VecLen = 13,
      InRange = 10:200,
      Tolerance = 0)))
    # Total =  1376 
    # Total =  1376 
    # Total =  1376 
    # Total =  1376 
    # Total =  1376 
    #    user  system elapsed 
    #   0.144   0.000   0.145
    output
    #       [,1] [,2] [,3] [,4] [,5]
    #  [1,]   29   46   11   43  171
    #  [2,]  103  161  113  195  197
    #  [3,]  145  134   91  131  147
    #  [4,]  154  173  138   19   17
    #  [5,]  197   62  173   11   87
    #  [6,]  101  142   87  173   99
    #  [7,]  168   61   97   40  121
    #  [8,]  140  121   99  135  117
    #  [9,]   46   78   31  200   79
    # [10,]  140  168  146   17   56
    # [11,]   21  146  117  182   85
    # [12,]   63   30  180  179   78
    # [13,]   69   54   93   51  122
    

    And the same for setting Tolerance = 5 and running the function 5 times.

    system.time(output <- replicate(5, SampleToSum(
      Target = 1376,
      VecLen = 13,
      InRange = 10:200,
      Tolerance = 5)))
    # Total =  1375 
    # Total =  1376 
    # Total =  1374 
    # Total =  1374 
    # Total =  1376 
    #    user  system elapsed 
    #   0.060   0.000   0.058 
    output
    #       [,1] [,2] [,3] [,4] [,5]
    #  [1,]   65  190  103   15   47
    #  [2,]  160   95   98  196  183
    #  [3,]  178  169  134   15   26
    #  [4,]   49   53  186   48   41
    #  [5,]  104   81  161  171  180
    #  [6,]   54  126   67  130  182
    #  [7,]   34  131   49  113   76
    #  [8,]   17   21  107   62   95
    #  [9,]  151  136  132  195  169
    # [10,]  194  187   91  163   22
    # [11,]   23   69   54   97   30
    # [12,]  190   14  134   43  150
    # [13,]  156  104   58  126  175
    

    Not surprisingly, setting the tolerance to 0 would make the function slower.


    Speed (Or lack thereof)

    Note that since this is a "random" process, it's hard to guess how long it would take to find the right combination of numbers. For example, using set.seed(123), I ran the following test three times in a row:

    system.time(SampleToSum(Target = 1163,
                            VecLen = 15,
                            InRange = 50:150))
    

    The first run took just over 9 seconds. The second took just over 7.5 seconds. The third took... just under 381 seconds! That's a lot of variation!

    Out of curiosity, I added a counter into the function, and the first run took 55026 attempts to arrive at a vector that satisfied all of our conditions! (I didn't bother trying for the second and third attempts.)

    It might be good to add some error or sanity checking into the function to make sure the inputs are reasonable. For example, one should not be able to enter SampleToSum(Target = 100, VecLen = 10, InRange = 15:50) since with a range of 15 to 50, there's no way to get to 100 AND have 10 values in your vector.

    0 讨论(0)
  • 2021-01-19 02:36

    I thought of stars and bars and partitions in combinatorics:

    foo <- function(n,total) {
      while(!exists("x",inherits=FALSE) || 1 %in% diff(x)) {
        x <- sort(c(0,sample.int(n+total,n-1,replace=FALSE),n+total))
      }
      print(x)
      sort(diff(x)-1)
    }
    

    Another method is to use the partitions package. This is more suited to enumerating all partitions, but it is okay for now. It works as long as your total number is small.

    require(partitions)
    foo <- function(n,total) { 
      x <- restrictedparts(total,n,include.zero=FALSE)
      return(x[,sample.int(ncol(x),1)])
    }
    
    0 讨论(0)
  • 2021-01-19 02:38

    Assuming that you want integers (if not then look at the Dirichlet distribution) then this can be thought of as a ball and urn problem (without further restrictions on relationship between the numbers).

    If you want 20 numbers then that can be represented by 20 urns. You want the numbers to sum to 100 so that is 100 balls. Since you want exactly 20 numbers (skip this step if you want up to 20 numbers, but could be fewer) you start by placing 1 ball in each urn, then randomly distribute the remaining balls between the urns. Count the number of balls in each urn and you will have 20 numbers that sum to 100.

    As R code:

    as.vector(table( c( 1:20, sample(1:20, 80, replace=TRUE) ) ))
    

    The as.vector just strips off the table class and labels.

    Quick, simple, exact, no loops, recursion, etc.

    For other totals or number of values just change the appropriate pieces above.

    0 讨论(0)
  • 2021-01-19 02:52

    An attempt using R

    # Config
    n <- 20L
    target <- 100L
    vec <- seq(100)
    set.seed(123)
    
    # R repeat loop
    sumto_repeat <- function(vec,n,target) {
      res <- integer()
      repeat {
        cat("begin:",sum(res),length(res),"\n")
        res <- c( res, sample(vec,1) )
        if( sum(res)<target & length(res)==(n-1) ) {
          res[length(res)+1] <- target - sum(res)
        }
        # cat("mid:",sum(res),length(res),"\n")
        if(sum(res)>target) res <- res[-length(res)]
        if( length(res)>n | length(res)<n & sum(res)==target ) {
          res <- res[-sample(seq(length(res)),1)]
        }
        # cat("end:",sum(res),length(res),"\n")
        # cat(dput(res),"\n")
        if( sum(res)==target & length(res)==n ) break
      }
      res
    }
    
    test <- sumto_repeat(vec=vec,n=n,target=target)
    > sum(test)
    [1] 100
    > length(test)
    [1] 20
    

    Also, I'd give some thought to what distribution you'd like to be drawing from. I think that there are a few different ways of getting it to sum to exactly target with n elements (for instance, you could make the last element always be target - sum(res)) that may or may not have different distributional implications.

    A very similar algorithm in Rcpp, for speeeeed!

    cpp_src <- '
    Rcpp::IntegerVector xa = clone(x); // Vector to be sampled
    Rcpp::IntegerVector na(n); // Number of elements in solution
    Rcpp::IntegerVector sa(s); // Sum of solution
    
    int nsampled;
    int currentSum;
    int dropRandomIndex;
    int numZeroes;
    Rcpp::IntegerVector remainingQuantity(1);
    int maxAttempts = 100;
    
    // Create container for our results
    Rcpp::IntegerVector res(maxAttempts);
    std::fill( res.begin(), res.end(), NA_INTEGER );
    
    // Calculate min/max so that we can draw random integers from within range
    Rcpp::IntegerVector::iterator mn = std::min_element(xa.begin(), xa.end()) ;
    Rcpp::IntegerVector::iterator mx = std::max_element(xa.begin(), xa.end()) ;
    std::cout << "mx = " << *mx << std::endl;
    
    // Now draw repeatedly
    nsampled = 0;
    for( int i = 0; i < maxAttempts; i++ ) {
      std::cout << "\\n" << i;
      int r = *mn + (rand() % (int)(*mx - *mn + 1));
      res[i] = xa[r+1];
      // Calculate n and s for current loop iteration
      numZeroes = 0;
      for( int j = 0; j < maxAttempts; j++) 
        if(res[j]==0) numZeroes++;
      std::cout << " nz= " << numZeroes ;
      nsampled = maxAttempts - sum( is_na(res) ) - numZeroes - 1;
      currentSum = std::accumulate(res.begin(),res.begin()+i,0); // Cant just use Rcpp sugar sum() here because it freaks at the NAs
      std::cout << " nsamp= " << nsampled << " sum= " << currentSum;
      if(nsampled == na[0]-1) {  
        std::cout << " One element away. ";
        remainingQuantity[0] = sa[0] - currentSum;
        std::cout << "remainingQuantity = " << remainingQuantity[0];
        if( (remainingQuantity[0] > 0) && (remainingQuantity[0]) < *mx ) {
          std::cout << "Within range.  Prepare the secret (cheating) weapon!\\n";
          std::cout << sa[0] << " ";
          std::cout << currentSum << " ";
          std::cout << remainingQuantity[0] << std::endl;
          if( i != maxAttempts ) {
            std::cout << "Safe to add one last element on the end.  Doing so.\\n";
            res[i] = remainingQuantity[0];
          }
          currentSum = sa[0];
          nsampled++;
          if(nsampled == na[0] && currentSum == sa[0]) std::cout << "It should end after this...nsamp= " << nsampled << " and currentSum= " << currentSum << std::endl;
          break;
        } else {
          std::cout << "Out of striking distance.  Dropping random element\\n";
          dropRandomIndex = 0 + (rand() % (int)(i - 0 + 1));
          res[dropRandomIndex] = 0;
        }
      }
      if(nsampled == na[0] && currentSum == sa[0]) {
          std::cout << "Success!\\n";
          for(int l = 0; l <= i+1; l++) 
            std::cout << res[l] << " " ;
          break;
      }
      if(nsampled == na[0] && currentSum != sa[0]) {
        std::cout << "Reached number of elements but sum is ";
        if(currentSum > sa[0]) {
          std::cout << "Too high. Blitz everything and start over!\\n";
          for(int k = 0; k < res.size(); k++) {
            res[k] = NA_INTEGER;
          }
        } else {
          std::cout << "Too low.  \\n";
    
        }
      }
      if( nsampled < na[0] && currentSum >= sa[0] ) {
        std::cout << "Too few elements but at or above the sum cutoff.  Dropping a random element and trying again.\\n";
        dropRandomIndex = 0 + (rand() % (int)(i - 0 + 1));
        res[dropRandomIndex] = 0;
      }
    }
    return res;
    '
    
    sumto <- cxxfunction( signature(x="integer", n="integer", s="integer"), body=cpp_src, plugin="Rcpp", verbose=TRUE )
    
    testresult <- sumto(x=x, n=20L, s=1000L)
    testresult <- testresult[!is.na(testresult)]
    testresult <- testresult[testresult!=0]
    testresult
    cumsum(testresult)
    length(testresult)
    

    Tried it with a few different values, and produces valid answers unless it runs away. There's a caveat here, which is that it cheats if it's one away from the desired number of elements and within "striking distance" -- e.g. rather than just drawing the last value it calculates it if that number is valid.

    Benchmarks

    See gist for comparison code.

    benchmarks

    0 讨论(0)
提交回复
热议问题