问题
I would like to build separate models for the different segments of my data. I have built the models like so:
log1 <- glm(y ~ ., family = "binomial", data = train, subset = x1==0)
log2 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2<10)
log3 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2>=10)
If I run the predictions on the training data, R remembers the subsets and the prediction vectors are with the length of the respective subset.
However, if I run the predictions on the testing data, the prediction vectors are with the length of the whole dataset, not that of the subsets.
My question is whether there is a simpler way to achieve what I would by first subsetting the testing data, then running the predictions on each dataset, concatenating the predictions, rbinding the subset data, and appending the concatenated predictions like this:
T1 <- subset(Test, x1==0)
T2 <- subset(Test, x1==1 & x2<10)
T3 <- subset(Test, x1==1 & x2>=10)
log1pred <- predict(log1, newdata = T1, type = "response")
log2pred <- predict(log2, newdata = T2, type = "response")
log3pred <- predict(log3, newdata = T3, type = "response")
allpred <- c(log1pred, log2pred, log3pred)
TAll <- rbind(T1, T2, T3)
TAll$allpred <- as.data.frame(allpred)
I'd like to think I am being stupid and there is an easier way to accomplish this - many models on small subsets of the data. How to combine them to get the predictions on the full testing data?
回答1:
First, here's some sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T),
x2=rpois(100,10),
y=sample(0:1, 100, replace=T))
test <- data.frame(x1=sample(0:1, 10, replace=T),
x2=rpois(10,10))
Now we can fit the models. Here I place them in a list to make it easier to keep them together, and I also remove x1
from the model since it will be fixed for each subset
fits<-list(
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==0),
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2<10),
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2>=10)
)
Now, for the training data, I create an indicator which specifies which group the observation falls into. I do this by looking at the subset=
parameter of each of the calls and evaluating those conditions in the test data.
whichsubset <- as.vector(sapply(fits, function(x) {
subsetparam<-x$call$subset
eval(subsetparam, test)
})%*% matrix(1:length(fits), ncol=1))
You'll want to make sure your groups are mutually exclusive because this code does not check. Then you can use factor with a split/unsplit strategy for making your predictions
unsplit(
Map(function(a,b) predict(a,b),
fits, split(test, whichsubset)
),
whichsubset
)
And even easier strategy would have been just to create the segregating factor in the first place. This would make the model fitting easier as well.
来源:https://stackoverflow.com/questions/31597258/simple-way-to-combine-predictions-from-multiple-models-for-subset-data-in-r