Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

别等时光非礼了梦想. 提交于 2019-12-22 08:10:25

问题


I have a series like:

df['ID'] = ['ABC123', 'IDF345', ...]

I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.

During the training, I'm doing as follows:

le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID) 

But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.

In the test file, I was doing as follows:

new_df['ID'] = le_dpid.transform(new_df.ID)

But, I'm getting the following error: ValueError: y contains new labels

How do I fix this?? Thanks!

UPDATE:

So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.

df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

And then predict it on something like:

BankNum   |  ID | 

00982222  | AB999 | 
00982222  | AB999 |
00981111  | AB890 |

I'm doing something like this:

df['BankNum'] = df.BankNum.astype(np.float128)

    le_id = LabelEncoder()
    df['ID'] = le_id.fit_transform(df.ID)

X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=140)
    clf.fit(X_train, y_train)

回答1:


I think the error message is very clear: Your test dataset contains ID labels which have not been included in your training data set. For this items, the LabelEncoder can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.

One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID values, train the LabelEncoder on this list, and keep the rest of your code just as it is at the moment.

An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id (or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.




回答2:


you can try solution from "sklearn.LabelEncoder with never seen before values" https://stackoverflow.com/a/48169252/9043549 The thing is to create dictionary with classes, than map column and fill new classes with some "known value"

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int) 


来源:https://stackoverflow.com/questions/46288517/getting-valueerror-y-contains-new-labels-when-using-scikit-learns-labelencoder

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!