Randomly associate elements of two vectors given conditions

I have a data.table of capitals

capitals<-data.table(capital=c(100,50,25,5))
capitals
   capital
1:     100
2:      50
3:      25
4:       5

and a data.table of losses

losses<-data.table(loss=c(45,10,5,1))
losses
   loss
1:   45
2:   10
3:    5
4:    1

I would like to randomly associate each capital with a loss (without replacement) such that the loss is less than or equal to the capital. In pseudo code one possible implementation would be

Set all capitalLoss to NA (i.e. capitals[, capitalLoss:=NA])
Order losses from largest to smallest
For each loss in losses
    randomly pick from capitals where capital>=loss and is.na(capitalLoss)
    set capitalLoss to loss
Next

How can I implement this so that it's very efficient? You may assume that capitals and losses have the same number of rows and that at least one mapping as I described it is possible.

Possible random associations for this example are

   capital capitalLoss
1:     100          10
2:      50          45
3:      25           1
4:       5           5

and

   capital capitalLoss
1:     100          45
2:      50           1
3:      25          10
4:       5           5

The naive solution to this problem involves a loop over n capital values and, for each capital value, a search over n loss values so that the solution time varies by n^2. Probably not much can be done about the capital loop, but the loss search time can be reduced in two ways. First, find the upper bounds for the losses which need to be searched can be found as Alex and Shambho do by sorting and using findInterval() and then second within the capital loop the list of possible losses to be passed to sample() can be updated as I do below rather than re-created from the entire list. Since the size of the list of possible losses is always much smaller than n, the execution times with this approach increase more nearly linearly with n which results in significantly reduced execution times for this range of n. It’s also helpful to create the loss tracking vector with full space rather than alloc space on each iteration in loop. My function also returns the results in the same order as the capital values were input which seems proper. Microbenchmark reports the times for ffben and ffwalt as shown below for both of Ben’s data sets. Note that times are in milliseconds.

Unit: milliseconds

              expr         min         lq      median          uq         max neval
    ffben(cap2, los2)   1549.8289   1556.113   1565.7139   1592.3230   1593.9527     5
   ffwalt(cap2, los2)    205.4834    206.267    206.5975    207.0464    212.9808     5
 ffben(capital, loss) 154235.8823 154855.444 154969.9196 155052.6070 156250.5489     5
ffwalt(capital, loss)   2071.3610   2074.692   2099.4889   2100.1091   2117.4721     5

Since the capital data set is 10x the size of the cap2 data set, it appears that the times for ffben increase as n^2 while the times for ffwalt increase only linearly, both as expected.

ffwalt <- function( caps, loss) {
len_cap <- length(caps)
loss_srt <- sort(loss)
caps_ord <- order(caps)
caps_srt <- caps[caps_ord]
cap_mx_ls_idx <- findInterval(caps_srt, loss_srt)  # find upper loss bounds for each value of capital
loss_picked <- vector("numeric",len_cap)  #  alocate space for full loss vector to avoid mem alloc time in capital loop
samp <- seq_len(cap_mx_ls_idx[1])
for( i in seq_len(len_cap-1) )  {
  loss_picked[i] <- sample(x=samp,1, replace=FALSE)
  if(cap_mx_ls_idx[i+1] > cap_mx_ls_idx[i]) 
       add_samp <- seq(cap_mx_ls_idx[i]+1,cap_mx_ls_idx[i+1],1)
  else add_samp  <- NULL
  samp <- c(samp[samp != loss_picked[i]], add_samp)
}
loss_picked[len_cap] <- samp             # avoid problem with sample() when x has length 1
results <- data.frame(capital=caps_srt, loss=loss_srt[loss_picked])
results[caps_ord,] <- results            # restore original caps order
return(results)
}

For an easily understandable answer: You can first build a column loss in capitals data.frame and then repeatedly sample for those rows which needs to be corrected:

capitals<-data.frame(capital=c(100,50,25,5))
loss=c(45,10,5,1)

capitals$loss <- sample(loss,replace=F)
capitals
   capital loss
1     100    5
2      50   10
3      25    1
4       5   45

for(i in 1:nrow(capitals)) {
    while(capitals[i,2]>capitals[i,1]){
        capitals[i,2] <- sample(loss, 1)
    }
}

capitals
capital loss
1     100    5
2      50   10
3      25    1
4       5    5

(Note that the last row has been corrected)

If replace=F is needed, one can repeat sampling of entire dataframe till all rows satisfy the criteria:

    capitals<-data.frame(capital=c(100,50,25,5))
    loss=c(45,10,5,1)

    capitals$loss <- sample(loss,replace=F)
    capitals
       capital loss
    1     100    5
    2      50   10
    3      25    1
    4       5   45

while (any(capitals$loss > capitals$capital)) { 
                capitals$loss <- sample(loss,replace=F)}

capitals 
  capital loss
1     100   10
2      50   45
3      25    5
4       5    1

First off, thank you everyone for your attempts. I've implemented a simple algorithm which is quicker than the answers thus far (and easier to understand, I think).

ffben<-function(capitals, losses){ #note, the inputs here are vectors, not data.tables
  lossSamples<-numeric()
  capitals<-sort(capitals)
  for(i in 1:(length(capitals)-1)){
    lossSamples[i]<-sample(x=losses[losses<=capitals[i]],1)
    losses<-losses[-which(losses==lossSamples[i])[1]]
  }
  lossSamples[i+1]<-losses[1]
  return(data.table(capitals=capitals, losses=lossSamples))
}

Benchmark against alexis's solution

cap2 = 1:10000; los2 = pmax(0,1:10000-10)  #10 capitals and losses
microbenchmark::microbenchmark(ffalex(cap2, los2), ffben(cap2, los2), times = 5)

Unit: seconds
               expr   min    lq median    uq   max neval
 ffalex(cap2, los2) 3.725 3.775  3.792 3.977 5.606     5
  ffben(cap2, los2) 2.680 2.868  2.890 2.897 3.056     5

However, I recognize that my solution still has much room for improvement, so I won't accept it as the best answer unless it's still the quickest solution in a week or so. In particular, I am hoping someone can develop a data.table based solution that takes advantage of data.table's inherent binary searching algorithms.

Unless I've missed something, here is an approach that looks valid:

capital = c(100, 50, 25, 5); loss = c(45, 10, 5, 1)

sc = sort(capital)
sl = sort(loss)
allowed = lapply(findInterval(sc, sl), seq_len)

replicate(10, {  #just to replicate the process
    tmp = seq_along(loss)
    sams = rep(NA, length(loss))
    for(i in seq_along(allowed)) {
        intsec = intersect(allowed[[i]], tmp)
        s = intsec[sample(length(intsec), 1)]
        tmp[s] = NA
        sams[i] = s
    }
    sl[sams]
})
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,]    1    1    1    5    1    1    1    5    5     1
#[2,]   10   10    5    1   10   10   10    1    1     5
#[3,]   45    5   10   45    5   45   45   10   45    45
#[4,]    5   45   45   10   45    5    5   45   10    10

Each element in each column, above, corresponds to its respective element in "sc" (sorted capital) [5 25 50 100].

And some benchmarkings comparing with rnso's answer:

cap2 = sample(100:500, 10); los2 = sample(50:250, 10)  #10 capitals and losses
microbenchmark::microbenchmark(ffalex(cap2, los2), ffrnso(cap2, los2), times = 5)
#Unit: microseconds
#               expr     min      lq  median      uq     max neval
# ffalex(cap2, los2) 385.589 396.377 399.162 434.309 591.608     5
# ffrnso(cap2, los2)  14.964  21.577  27.492  42.456  80.389     5
cap2 = sample(100:500, 50); los2 = sample(50:250, 50)  #50
microbenchmark::microbenchmark(ffalex(cap2, los2), ffrnso(cap2, los2), times = 5)
#Unit: milliseconds
#               expr       min        lq     median          uq         max neval
# ffalex(cap2, los2)   1.62031   1.64467   1.949522    1.966226    3.508583     5
# ffrnso(cap2, los2) 283.27681 538.50515 971.273262 3348.542296 4279.280326     5
cap2 = sample(100:500, 2e2); los2 = sample(50:250, 2e2)  #200
system.time({ ans1 = ffalex(cap2, los2) })
#   user  system elapsed 
#   0.01    0.02    0.03 
system.time({ ans2 = ffrnso(cap2, los2) })
#Timing stopped at: 77.69 0.14 78.22

And check that indeed all losses are "<=" to capital :

#head(ans1)
#      sc   
#[1,] 100 83
#[2,] 101 92
#[3,] 103 59
#[4,] 107 52
#[5,] 109 74
#[6,] 110 79
sum(ans1[, 2] > ans1[, 1])
#[1] 0   #none is greater

The two functions:

ffalex = function (capital, loss) 
{
    sc = sort(capital)
    sl = sort(loss)
    allowed = lapply(findInterval(sc, sl), seq_len)
    tmp = seq_along(loss)
    sams = rep(NA, length(loss))
    for (i in seq_along(allowed)) {
        intsec = intersect(allowed[[i]], tmp)
        s = intsec[sample(length(intsec), 1)]
        tmp[s] = NA
        sams[i] = s
    }
    cbind(sc, sl[sams])
}

ffrnso = function (capital, loss) 
{
    while (any(loss > capital)) {
        loss <- sample(loss, replace = F)
    }
    cbind(capital, loss)
}

Try this for small vectors:

capital=c(100,50,25,5)
loss=c(45,10,5,1)

posC<- order(capital)
posC

lossN <- NULL

for(i in posC){
  temp <- sample(which(loss<=capital[i]),1)
  lossN <- c(lossN, loss[temp])
  loss <-loss[-temp]
}


data.table(capital=capital,loss=lossN[posC])

EDIT

This one is for large vectors:

set.seed(100)
loss=sort(sample(1:5000,100000,replace = T))
capitals = sort(sample(1:100000,100000,replace=T))    

capU <- unique(capitals)
length(capU)

splitLoss <- split(loss,findInterval(loss,sort(c(0,capU))))
head(splitLoss)
splitCap <- split(capitals,findInterval(capitals,sort(c(0,capU))))
head(splitCap)

lossN <- NULL
temp <- NULL

for(i in 1:length(splitLoss)){  
  temp <- c(temp,splitLoss[[i]])  
  for(j in 1:length(splitCap[[i]])){
    id <- sample(1:length(temp),1)
    lossN <- c(lossN, temp[id])
    temp <-temp[-id]      
  }
}

lossN <- c(lossN,ifelse(length(temp)==1,temp,sample(temp)))
data.table(capital=capitals,loss=lossN)

This takes about 7 sec on my machine. The only assumption here is that capitals is sorted and increasing. If needed you can use the order function to make this work for unordered values of capitals in two more lines.

Hope this helps!!

来源：https://stackoverflow.com/questions/25221199/randomly-associate-elements-of-two-vectors-given-conditions

标签

data.table

random-sample