Patsy: New levels in categorical fields in test data

对着背影说爱祢 提交于 2019-12-23 07:48:41

问题


I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.

My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:

df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')

The last line throws the following error:

patsy.PatsyError: Error converting data to categorical: observation with value 'Kolkata' does not match any of the expected levels

I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.

Is there any way I can make this work with Patsy?


回答1:


The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.

One way is by using the levels= argument to C(...), like:

# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))

dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)

Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.




回答2:


I ran into a similar problem and I built the design matrices prior to splitting the data.

df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe')
df_train_X, df_test_X, df_train_Y, df_test_Y = \
    train_test_split(df_X, df_Y, test_size=test_size)

Then as an example of applying a fit:

model = smf.OLS(df_train_Y, df_train_X)
model2 = model.fit()
predicted = model2.predict(df_test_X)

Technically I haven't built a test case, but I haven't run into the Error converting data to categorical error again since implementing the above.



来源:https://stackoverflow.com/questions/34035912/patsy-new-levels-in-categorical-fields-in-test-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!