问题
I wrote the below code. X
is a dataframe with the shape (1000,5)
and y
is a dataframe with shape (1000,1)
. y
is the target data to predict, and it is imbalanced. I want to apply cross validation and SMOTE.
def Learning(n, est, X, y):
s_k_fold = StratifiedKFold(n_splits = n)
acc_scores = []
rec_scores = []
f1_scores = []
for train_index, test_index in s_k_fold.split(X, y):
X_train = X[train_index]
y_train = y[train_index]
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
X_test = X[test_index]
y_test = y[test_index]
est.fit(X_resampled, y_resampled)
y_pred = est.predict(X_test)
acc_scores.append(accuracy_score(y_test, y_pred))
rec_scores.append(recall_score(y_test, y_pred))
f1_scores.append(f1_score(y_test, y_pred))
print('Accuracy:',np.mean(acc_scores))
print('Recall:',np.mean(rec_scores))
print('F1:',np.mean(f1_scores))
Learning(3, SGDClassifier(), X_train_s_pca, y_train)
When I run the code, I get the below error:
None of [Int64Index([ 4231, 4235, 4246, 4250, 4255, 4295, 4317, 4344, 4381,\n 4387,\n ...\n 13122, 13123, 13124, 13125, 13126, 13127, 13128, 13129, 13130,\n
13131],\n dtype='int64', length=8754)] are in the [columns]"
Help to make it run is appreciated.
回答1:
If you observe the error stack trace (which is important but you don't include) carefully, you should see that the error comes from these line (and will come from other similar lines):
X_train = X[train_index]
This way of selecting rows only applicable for Numpy array. Since you are using Pandas DataFrame, you should use loc:
X_train = X.loc[train_index]
Alternatively, you can convert the DataFrame to Numpy array instead (to minimize code change) by using values:
Learning(3, SGDClassifier(), X_train_s_pca.values, y_train.values)
来源:https://stackoverflow.com/questions/56149457/function-for-cross-validation-and-oversampling-smote