Simple Way to Combine Predictions from Multiple Models for Subset Data in R

问题

I would like to build separate models for the different segments of my data. I have built the models like so:

log1 <- glm(y ~ ., family = "binomial", data = train, subset = x1==0)
log2 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2<10)
log3 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2>=10)

If I run the predictions on the training data, R remembers the subsets and the prediction vectors are with the length of the respective subset.

However, if I run the predictions on the testing data, the prediction vectors are with the length of the whole dataset, not that of the subsets.

My question is whether there is a simpler way to achieve what I would by first subsetting the testing data, then running the predictions on each dataset, concatenating the predictions, rbinding the subset data, and appending the concatenated predictions like this:

T1 <- subset(Test, x1==0)
T2 <- subset(Test, x1==1 & x2<10)
T3 <- subset(Test, x1==1 & x2>=10)
log1pred <- predict(log1, newdata = T1, type = "response")
log2pred <- predict(log2, newdata = T2, type = "response")
log3pred <- predict(log3, newdata = T3, type = "response")
allpred <- c(log1pred, log2pred, log3pred)
TAll <- rbind(T1, T2, T3)
TAll$allpred <- as.data.frame(allpred)

I'd like to think I am being stupid and there is an easier way to accomplish this - many models on small subsets of the data. How to combine them to get the predictions on the full testing data?

回答1:

First, here's some sample data

set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T),
  x2=rpois(100,10),
  y=sample(0:1, 100, replace=T))
test <- data.frame(x1=sample(0:1, 10, replace=T),
  x2=rpois(10,10))

Now we can fit the models. Here I place them in a list to make it easier to keep them together, and I also remove x1 from the model since it will be fixed for each subset

fits<-list(
  glm(y ~ .-x1, family = "binomial", data = train, subset = x1==0),
  glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2<10),
  glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2>=10)
)

Now, for the training data, I create an indicator which specifies which group the observation falls into. I do this by looking at the subset= parameter of each of the calls and evaluating those conditions in the test data.

whichsubset <- as.vector(sapply(fits, function(x) {
    subsetparam<-x$call$subset
    eval(subsetparam, test)
})%*% matrix(1:length(fits), ncol=1))

You'll want to make sure your groups are mutually exclusive because this code does not check. Then you can use factor with a split/unsplit strategy for making your predictions

unsplit(
    Map(function(a,b) predict(a,b), 
        fits, split(test, whichsubset)
    ), 
    whichsubset
 )

And even easier strategy would have been just to create the segregating factor in the first place. This would make the model fitting easier as well.

来源：https://stackoverflow.com/questions/31597258/simple-way-to-combine-predictions-from-multiple-models-for-subset-data-in-r

标签

predict

multiple-models