R supplying arguments while using case_when (R vectorization)

问题

This is a follow up question to a question that I asked before (R apply multiple functions when large number of categories/types are present using case_when (R vectorization)). Unfortunately I have not been able to figure out the problem. I think I may have narrowed down the source of the problem an wanted to check if someone with a better understanding than me could help me figure out a solution.

Suppose I have the following dataset:

set.seed(100)
City=c("City1","City2","City2","City1")
Business=c("B","A","A","B")
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)

Here suppose that there exists 2 different business named "A" and "B". Further suppose there exists two different cities City1 and City2. My original dataset contains about 200K observations with multiple Businesses and about 100 cities. For each city, I have a unique pre-written function to compute adjusted revenue. Instead of running them by each observation/row, I want to use case_when to run the function for the relevant city (for eg take the observations for City 1, run a vectorized function for city 1 if possible then move to city 2 and so on).

For the purposes of illustration, suppose I have the following highly simplified functions for the two cities.

#Writing the custom functions for the categories here
City1=function(full_data,observation){
  NewSet=full_data[which(full_data$City==observation$City),]
  BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
  return(BusinessMax)
}

City2=function(full_data,observation){
  NewSet=full_data[which(full_data$City==observation$City),]
  BusinessMax = max(NewSet$ExpectedRevenue)-1000*rnorm(1)
  return(BusinessMax)
}

These simple functions here essentially subset the data for the city, and adds (City1) or subtracts (City2) some random number from the expected revenue. Once again, these simple functions are simply for illustration and does not reflect the actual functions. I also manually check, if the functions work by typing in:

City1(full_data = zz,observation = zz[1,])
City1(full_data = zz,observation = zz[4,])

and get "29.97808" and "36.31531". Note that in the above functions, since I add or subtract a random number, I would expect to get different values for two observations in the same city like I have obtained here.

Finally, I try to use case_when to run the code as follows:

library(dplyr) #I use dplyr here
zz[,"AdjustedRevenue"] = case_when(
  zz[["City"]]=="City1"~City1(full_data=zz,observation=zz[,]),
  zz[["City"]]=="City2"~City2(full_data=zz,observation=zz[,])
)

The output I receive is the following:

   City Business ExpectedRevenue AdjustedRevenue
1 City1        B              35        43.86785
2 City2        A              20       -81.97127
3 City2        A              15       -81.97127
4 City1        B              19        43.86785

Here, for observations 1 and 4 & 2 and 3, the adjusted values are the same. Instead what I would expect is to obtain different values for each observation (since I add or remove some random number for each observation; or atleast intended to). Following Martin Gal's answer to my previous question (https://stackoverflow.com/a/62378991/3988575), I suspect this is due to not calling the 2nd argument of my City1 and City2 functions correctly in the final step. However, I have been somewhat lost trying to figure out why and what to do in order to fix it.

It'd be really helpful If someone could point out why this is happening and how to fix this error. Thanks in advance!

P.S. I am also open to other vectorized solutions. I am relatively new to vectorization and do not have much experience in it and would appreciate any suggestions.

回答1:

Converted the City functions to dplyr. If CityMaster is too simplified for the final function then mer could be moved inside the case_when as applicable. If a new city is added to the data then it will return NA until a case is defined.

library(dplyr)
CityMaster <- function(data, city) {
  mer <- data %>%
    filter(City == city) %>%
    pull(ExpectedRevenue) %>%
    max()
  case_when(city == 'City1' ~ mer + 10 * rnorm(1),
            city == 'City2' ~ mer - 1000 * rnorm(1),
            TRUE ~ NA_real_)
}

set.seed(100)
zz %>%
  rowwise() %>%
  mutate(AdjustedRevenue = CityMaster(., City))

# A tibble: 4 x 4
# Rowwise: 
  City  Business ExpectedRevenue AdjustedRevenue
  <chr> <chr>              <dbl>           <dbl>
1 City1 B                     35            30.0
2 City2 A                     20          -867. 
3 City2 A                     15          -299. 
4 City1 B                     19            29.2

Breaking City functions apart

City1 <- function(data, city) {
  data %>%
    filter(City == city) %>%
    pull(ExpectedRevenue) %>%
    max() + 10 * rnorm(1)
}

City2 <- function(data, city) {
  data %>%
    filter(City == city) %>%
    pull(ExpectedRevenue) %>%
    max() - 1000 * rnorm(1)
}

set.seed(100)
zz %>%
  rowwise() %>%
  mutate(AdjustRevenue = case_when(City == 'City1' ~ City1(., City),
                                   City == 'City2' ~ City2(., City),
                                   TRUE ~ NA_real_))

来源：https://stackoverflow.com/questions/62435406/r-supplying-arguments-while-using-case-when-r-vectorization

标签

vectorization

case-when