Fill contingency table based on total variable

后端 未结 3 811
忘了有多久
忘了有多久 2021-01-23 03:07

I have a list of stores and I have a product (apples). I ran a system of linear equations to get the column \'var\'; this value represents the amount of apples you will

相关标签:
3条回答
  • 2021-01-23 03:48

    The accepted answer works nicely but I thought I would add one that approaches the problem as a linear programming problem. This would be helpful if

    1. You need to scale the problem to a large number of stores or
    2. You end up deciding that there are genuine cost differences between shipping from store a to store f versus store a to store b and you want a minimal cost solution

    The structure of your problem is a linear programming problem known as the transport problem. Yours is a tidy case where: 1. it costs the same to move a good from any sender to any receiver and 2. your system is balanced insofar as demand = supply.

    The easiest way to think about the constraints on the solution to the problem is (I think) in terms of the matrix of places that are sending goods versus those receiving them. We can make that matrix from your toy example:

    # Load the data
    df <- data.frame(store = c('a', 'b', 'c', 'd', 'e', 'f'), 
                     sku = c('apple', 'apple', 'apple', 'apple', 'apple', 'apple'), 
                     var = c(1,4,-6,-1,5,-3))
    df
    #>   store   sku var
    #> 1     a apple   1
    #> 2     b apple   4
    #> 3     c apple  -6
    #> 4     d apple  -1
    #> 5     e apple   5
    #> 6     f apple  -3
    
    # Seeing the row-column constraints
    sol.mat <- matrix(c(1,4,1,0,0,1,0,0,3), nrow = 3, byrow = TRUE)
    rownames(sol.mat) <- -1 * df$var[df$var < 0]
    colnames(sol.mat) <- df$var[df$var >= 0]
    sol.mat
    #>   1 4 5
    #> 6 1 4 1
    #> 1 0 0 1
    #> 3 0 0 3
    

    What this matrix shows us is that the solution to the system you have proposed satisfies the constraints that all of the row sums equal the amount to be sent from each store and all the column sums equal the amount to be received. Any solution needs to meet those criteria. Thus if we have S senders (the rows) and R receivers (the columns) we have SxR unknowns. If we call each unknown x_ij where i indexes the sender and j the receiver we have the constraints that (A) sum_j x_ij = S_i and (B) sum_i x_ij = R_j. In a normal transport problem we would also have a cost associated with each link between a sender and receiver. This will be an SxR matrix, that we can call C. We would then seek the solution that minimizes costs, and solve numerically with min sum_i sum_j x_ij * c_ij, subject to (A) and (B).

    The fact that costs don't figure in your discussion just means all routes cost the same. We can still use this same structure of the problem to solve for a solution using the existing libraries R has for linear programming. I'm going to use the package lpSolve which has a function for solving precisely this kind of problem called lp.transport. Below I write a wrapper function around lp.transport that takes your known values and the store names and determines a valid solution. The function can also take a user supplied cost matrix (SxR), and can return output either in the compact form of a SxR matrix or as the larger matrix you are hunting for:

    get_transport_matrix <- function(vals, labels, costs = NULL, bigmat = TRUE) {
      if (sum(vals) != 0) {stop("Demand and Supply are Imbalanced!")}
      S <- -1 * vals[which(vals < 0)]
      names(S) <- labels[which(vals < 0)]
      R <- vals[which(vals >=0)]
      names(R) <- labels[which(vals >=0)]
    
      if (is.null(costs)) {
        costs.mat <- matrix(1, length(S), length(R))
      } else {
        costs.mat <- costs
      }
    
      solution <- lpSolve::lp.transport(costs.mat, direction = 'min',
                               row.signs = rep("=", length(S)),
                               row.rhs = S,
                               col.signs = rep("=", length(R)),
                               col.rhs = R)$solution
    
      rownames(solution) <- names(S)
      colnames(solution) <- names(R)
    
      if (!bigmat) {
        return(solution)
      } else {
        bigres <- matrix(0, length(vals), length(vals), 
                         dimnames = list(labels, labels))
        bigres[names(S), names(R)] <- solution
        colnames(bigres) <- paste0("ship_to_", colnames(bigres))
        return(bigres)
      }
    }
    

    We can demo the function with your toy data to see how it works. Here I return just the small sender-receiver matrix. As we can see the solution is different to the one you supplied but also valid.

    get_transport_matrix(df$var, df$store, bigmat = FALSE)
    #>   a b e
    #> c 0 1 5
    #> d 0 1 0
    #> f 1 2 0
    

    Using a linear programming package scales easily. Here for instance we solve for 10 stores:

    get_transport_matrix(c(-10:-1, 10:1), 
                         c(letters[1:10], letters[1:10]),
                         bigmat = FALSE)[1:6,]
    #>   a b c d e f g h i j
    #> a 0 0 0 0 0 0 4 3 2 1
    #> b 0 0 0 0 4 5 0 0 0 0
    #> c 0 0 0 6 2 0 0 0 0 0
    #> d 0 0 6 1 0 0 0 0 0 0
    #> e 0 4 2 0 0 0 0 0 0 0
    #> f 0 5 0 0 0 0 0 0 0 0
    

    Finally, the default output of the function is in a large-matrix format and you can simply cbind() it to your dataframe to obtain your desired output:

    cbind(df, get_transport_matrix(df$var, df$store))
    #>   store   sku var ship_to_a ship_to_b ship_to_c ship_to_d ship_to_e
    #> a     a apple   1         0         0         0         0         0
    #> b     b apple   4         0         0         0         0         0
    #> c     c apple  -6         0         1         0         0         5
    #> d     d apple  -1         0         1         0         0         0
    #> e     e apple   5         0         0         0         0         0
    #> f     f apple  -3         1         2         0         0         0
    #>   ship_to_f
    #> a         0
    #> b         0
    #> c         0
    #> d         0
    #> e         0
    #> f         0
    

    Created on 2019-03-21 by the reprex package (v0.2.1)

    0 讨论(0)
  • 2021-01-23 03:52

    Here's a tidyverse solution. It relies on there being a net zero of each sku.

    If that's the case, then we should be able to line up all the donated items (one row for each unit in the negative vars, sorted by sku) with all the received items (one row for each positive var, sorted by sku). Consequently, the first 5 donated apples are matched with the first 5 received apples, and so on.

    Then we total up the total for each sku between each donor and recipient pair and spread so each recipient gets a column.

    Edit: corrected sign and added complete to match OP solution

    library(tidyverse)
    output <- bind_cols(
    
      # Donors, for whom var is negative
      df %>% filter(var < 0) %>% uncount(-var) %>% select(-var) %>%
        arrange(sku) %>% rename(donor = store),
    
      # Recipients, for whom var is positive
      df %>% filter(var > 0) %>% uncount(var) %>% 
        arrange(sku) %>% rename(recipient = store)) %>%
    
      # Summarize and spread by column
      count(donor, recipient, sku) %>%
      complete(donor, recipient, sku, fill = list(n = 0)) %>%
      mutate(recipient = paste0("ship_to_", recipient)) %>%
      spread(recipient, n, fill = 0)
    
    
    > output
    # A tibble: 6 x 8
      donor sku   ship_to_a ship_to_b ship_to_c ship_to_d ship_to_e ship_to_f
      <fct> <fct>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
    1 a     apple         0         0         0         0         0         0
    2 b     apple         0         0         0         0         0         0
    3 c     apple         1         4         0         0         1         0
    4 d     apple         0         0         0         0         1         0
    5 e     apple         0         0         0         0         0         0
    6 f     apple         0         0         0         0         3         0
    
    0 讨论(0)
  • 2021-01-23 03:54

    I bet there are simpler ways of doing this but this one works.
    The function fun outputs a result identical to the expected one.

    fun <- function(DF){
      n <- nrow(DF)
      mat <- matrix(0, nrow = n, ncol = n)
      VAR <- DF[["var"]]
      neg <- which(DF[["var"]] < 0)
      for(k in neg){
        S <- 0
        Tot <- abs(DF[k, "var"])
        for(i in seq_along(VAR)){
          if(i != k){
            if(VAR[i] > 0){
              if(S + VAR[i] <= Tot){
                mat[k, i] <- VAR[i]
                S <- S + VAR[i]
                VAR[i] <- 0
              }else{
                mat[k, i] <- Tot - S
                S <- Tot
                VAR[i] <- VAR[i] - Tot + S
              }
            }
          }
        }
      }
      colnames(mat) <- paste0("ship_to_", DF[["store"]])
      cbind(DF, mat)
    }
    
    out <- fun(df)
    identical(output, out)
    #[1] TRUE
    
    0 讨论(0)
提交回复
热议问题