All probability values are less than 0.5 on unseen data

问题

I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1.

Here is the Histogram of predicted probabilities on test data:

with majority values at 0 - 0.2 and 0.9 to 1, which is much accurate. But when I try to predict the probability values for unseen data or let's say all data points for which value of 0 or 1 is unknown, the predicted probabilities values are between 0 to 0.5 only for class 1. Why is that so? Aren't the values should be from 0.5 to 1?

Here is the histogram of predicted probabilities on unseen data:

I am using sklearn RandomforestClassifier in python. The code is below:

#Read the CSV
df=pd.read_csv('path/df_all.csv')

#Change the type of the variable as needed
df=df.astype({'probabilities': 'int32', 'CPZ_CI_new.tif' : 'category'})

#Response variable is between 0 and 1 having actual probabilities values
y = df['probabilities']

# Separate majority and minority classes
df_majority = df[y == 0]
df_minority = df[y == 1]

# Upsample minority class
df_minority_upsampled = resample(df_minority,
                                 replace=True,  # sample with replacement
                                 n_samples=100387,  # to match majority class
                                 random_state=42)  # reproducible results

# Combine majority class with upsampled minority class
df1 = pd.concat([df_majority, df_minority_upsampled])

y = df1['probabilities']
X = df1.iloc[:,1:138]

#Change interfere values to category
y_01=y.astype('category')

#Split training and testing
X_train, X_valid, y_train, y_valid = train_test_split(X, y_01, test_size = 0.30, random_state = 42,stratify=y)

#Model

model=RandomForestClassifier(n_estimators = 500,
                           max_features= 'sqrt',
                           n_jobs = -1,
                           oob_score = True,
                           bootstrap = True,
                           random_state=0,class_weight='balanced',)
#I had 137 variable, to select the optimum one, I used RFECV
rfecv = RFECV(model, step=1, min_features_to_select=1, cv=10, scoring='neg_brier_score')
rfecv.fit(X_train, y_train)

#Retrained the model with only 15 variables selected
rf=RandomForestClassifier(n_estimators = 500,
                           max_features= 'sqrt',
                           n_jobs = -1,
                           oob_score = True,
                           bootstrap = True,
                           random_state=0,class_weight='balanced',)

#X1_train is same dataframe with but with only 15 varible 
rf.fit(X1_train,y_train)

#Printed ROC metric
print('roc_auc_score_testing:', metrics.roc_auc_score(y_valid,rf.predict(X1_valid)))

#Predicted probabilties on test data
predv=rf.predict_proba(X1_valid)
predv = predv[:, 1]
print('brier_score_training:', metrics.brier_score_loss(y_train, predt))
print('brier_score_testing:', metrics.brier_score_loss(y_valid, predv))

#Output is,
roc_auc_score_testing: 0.9832652130944419
brier_score_training: 0.002380976369884945
brier_score_testing: 0.01669848089917487

#Later, I have images of that 15 variables, I created a data frame out(sample_img) of it and use the same function to predict probabilities. 

IMG_pred=rf.predict_proba(sample_img)
IMG_pred=IMG_pred[:,1]

回答1:

The results shown for your test data are not valid; you perform a mistaken procedure that has two serious consequences, which invalidate them.

The mistake here is that you perform the minority class upsampling before splitting to train & test sets, which should not be the case; you should first split into training and test sets, and then perform the upsampling only to the training data and not to the test ones.

The first reason why such a procedure is invalid is that, this way, some of the duplicates due to upsampling will end up both to the training and the test splits; the result being that the algorithm is tested with some samples that have already been seen during training, which invalidates the very fundamental requirement of a test set. For more details, see own answer in Process for oversampling data for imbalanced binary classification; quoting from there:

I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...

The second reason is that this procedure shows biased performance measures in a test set that is no longer representative of reality: remember, we want our test set to be representative of the real unseen data, which of course will be imbalanced; artificially balancing our test set and claiming that it has X% accuracy when a great part of this accuracy will be due to the artificially upsampled minority class makes no sense, and gives misleading impressions. For details, see own answer in Balance classes in cross validation (the rationale is identical for the case of train-test split, as here).

The second reason is why your procedure would still be wrong even if you had not performed the first mistake, and you had proceeded to upsample the training and test sets separately after splitting.

I short, you should remedy the procedure, so that you first split into training & test sets, and then upsample your training set only.

来源：https://stackoverflow.com/questions/61796112/all-probability-values-are-less-than-0-5-on-unseen-data

标签

python

machine-learning

scikit-learn

random-forest

imbalanced-data