问题
I'm taking part in the Coursera Practical Machine Learning course, and the coursework requires building predictive models using this dataset. After splitting the data into training
and testing
datasets, based on the outcome of interest (herewith labelled y
, but is in fact the classe
variable in the dataset):
inTrain <- createDataPartition(y = data$y, p = 0.75, list = F)
training <- data[inTrain, ]
testing <- data[-inTrain, ]
I have tried 2 different methods:
modFit <- caret::train(y ~ ., method = "rpart", data = training)
pred <- predict(modFit, newdata = testing)
confusionMatrix(pred, testing$y)
vs.
modFit <- rpart::rpart(y ~ ., data = training)
pred <- predict(modFit, newdata = testing, type = "class")
confusionMatrix(pred, testing$y)
I would assume they would give identical or very similar results, as the initial method loads the 'rpart' package (suggesting to me it uses this package for the method). However, the timings (caret
much slower) & results are very different:
Method 1 (caret)
:
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1264 374 403 357 118
B 25 324 28 146 124
C 105 251 424 301 241
D 0 0 0 0 0
E 1 0 0 0 418
Method 2 (rpart)
:
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1288 176 14 79 25
B 36 569 79 32 68
C 31 88 690 121 113
D 14 66 52 523 44
E 26 50 20 49 651
As you can see, the second approach is a better classifier - the first method is very poor for classes D & E.
I realise this may not be the most appropriate place to ask this question, but I would really appreciate a greater understanding of this and related issues. caret
seems like a great package to unify the methods and call syntax, but I'm now hesitant to use it.
回答1:
caret
actually does quite a bit more under the hood. In particular, it uses cross-validation to optimize the model hyperparameters. In your case, it tries three values of cp
(type modFit
and you'll see accuracy results for each value), whereas rpart
just uses 0.01 unless you tell it otherwise (see ?rpart.control
). The cross-validation will also take longer, especially since caret
uses bootstrapping by default.
In order to get similar results, you need to disable cross-validation and specify cp
:
modFit <- caret::train(y ~ ., method = "rpart", data = training,
trControl=trainControl(method="none"),
tuneGrid=data.frame(cp=0.01))
In addition, you should use the same random seed for both models.
That said, the extra functionality that caret
provides is a Good Thing, and you should probably just go with caret
. If you want to learn more, it's well-documented, and the author has a stellar book, Applied Predictive Modeling.
来源:https://stackoverflow.com/questions/29167265/why-do-results-using-carettrain-method-rpart-differ-from-rpartrpar