问题
I am getting very high RMSE and MAE for MLPRegressor , ForestRegression and Linear regression with only input variables scaled (30,000+) however when i scale target values aswell i get RMSE (0.2) , i will like to know if that is acceptable thing to do.
Secondly is it normal to have better R squared values for Test (ie. 0.98 and 0.85 for train)
Thank You
回答1:
It is actually a common practice to scale target values in many cases.
For example a highly skewed target may give better results if it is applied log
or log1p
transforms. I don't know the characteristics of your data, but there could a transformation that might decrease your RMSE.
Secondly, Test set is meant to be a sample of unseen data, to give a final estimate of your model's performance. When you see the unseen data and tune to perform better on it, it becomes a cross validation set.
You should try to split your data into three parts, Train, Cross-validation and test sets. Train on your data and tune parameters according to it's performance on cross validation and then after you are done tuning, run it on the test set to get a prediction of how it works on unseen data and mark it as the accuracy of your model.
回答2:
Answering your first question, I think you are quite deceived by the performance measures which you have chosen to evaluate your model with. Both RMSE and MAE are sensitive to the range in which you measure your target variables, if you are going to scale down your target variable then for sure the values of RMSE and MAE will go down, lets take an example to illustrate that.
def rmse(y_true, y_pred):
return np.sqrt(np.mean(np.square(y_true - y_pred)))
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
I have written two functions for computing both RMSE and MAE. Now lets plug in some values and see what happens,
y_true = np.array([2,5,9,7,10,-5,-2,2])
y_pred = np.array([3,4,7,9,8,-3,-2,1])
For the time being let's assume that the true and the predicted vales are as shown above. Now we are ready to compute RMSE and MAE for this data.
rmse(y_true,y_pred)
1.541103500742244
mae(y_true, y_pred)
1.375
Now let's scale down our target variable by a factor of 10 and compute the same measure again.
y_scaled_true = np.array([2,5,9,7,10,-5,-2,2])/10
y_scaled_pred = np.array([3,4,7,9,8,-3,-2,1])/10
rmse(y_scaled_true,y_scaled_pred)
0.15411035007422444
mae(y_scaled_true,y_scaled_pred)
0.1375
We can now very well see that just by scaling our target variable our RMSE and MAE scores have dropped creating an illusion that our model has improved, but actually NOT. When we scale back our model's predictions we are into the same state.
So coming to the point, MAPE (Mean Absolute Percentage Error) could be a better way to measure your performance of the model and it is insensitive to the scale in which the variables are measure. If you compute MAPE for both the sets of values we see that they are same,
def mape(y, y_pred):
return np.mean(np.abs((y - y_pred)/y))
mape(y_true,y_pred)
0.28849206349206347
mape(y_scaled_true,y_scaled_pred)
0.2884920634920635
So it is better to rely on MAPE over MAE or RMSE, if you want your performance measure to be independent on the scale in which they are measured.
Answering your second question, since you are dealing with some complicated models like MLPRegressor and ForestRegression which has some hyper-parameters which needs to be tuned to avoid over fitting, the best way to find the ideal levels of the hyper-parameters is to divide the data into train, test and validation and use techniques like K-Fold Cross Validation to find the optimal setting. It is quite difficult to say if the above values are acceptable or not just by looking at this one case.
Hope this helps!
来源:https://stackoverflow.com/questions/56689243/is-it-acceptable-to-scale-target-values-for-regressors