When scale the data, why the train dataset use \'fit\' and \'transform\', but the test dataset only use \'transform\'?
SAMPLE_COUNT = 5000
TEST_COUNT = 20000
see
Any transformation you do to the data must be done by the parameters generated by the training data.
Simply what fit()
method does is create a model that extracts the various parameters from your training samples to do the neccessary transformation later on. transform()
on the other hand is doing the actual transformation to the data itself returning a standardized or scaled form.
fit_transform()
is just a faster way of doing the operations of fit()
and transform()
consequently.
Important thing here is that when you divide your dataset into train and test sets what you are trying to achieve is somewhat simulate a real world application. In a real world scenario you will only have training data and you will develop a model according to that and predict unseen instances of similar data.
If you transform the entrire data with fit_transform()
and then split to train test you violate that simulation approach and do the transformation according to the unseen examples as well. Which will inevatibly result in an optimistic model as you already somewhat prepared your model by the unseen samples metrics as well.
If you split the data to train test and apply fit_transform()
to both you will also be mistaken as your first transformation of train data will be done by train splits metrics only and your second will be done by test metrics only.
The right way to do these preprocessings is to train any transformer with train data only and do the transformations to the test data. Because only then you can be sure that your resulting model represents a real world solution.
Following this it actually doesnt matter if you
fit(train)
then transform(train)
then transform(test)
OR
fit_transform(train)
then transform(test)