Dummy variables when not all categories are present

问题

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.

What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.

My question is: is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Something that would make this:

categories = ['a', 'b', 'c']

   cat
1   a
2   b
3   a

Become this:

  cat_a  cat_b  cat_c
1   1      0      0
2   0      1      0
3   1      0      0

回答1:

Using transpose and reindex

import pandas as pd

cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})

dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)

print dummies

    a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0

回答2:

TL;DR: pd.get_dummies(cat.astype('category', categories=categories)

is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummies takes into account. Here's an example:

In [1]: import pandas as pd

In [2]: possible_categories = list('abc')

In [3]: cat = pd.Series(list('aba'))

In [4]: cat = cat.astype('category', categories=possible_categories)

In [5]: cat
Out[5]: 
0    a
1    b
2    a
dtype: category
Categories (3, object): [a, b, c]

Then, get_dummies will do exactly what you want!

In [6]: pd.get_dummies(cat)
Out[6]: 
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0

There are a bunch of other ways to create a categorical Series or DataFrame, this is just the one I find most convenient. You can read about all of them in the pandas documentation.

EDIT:

I haven't followed the exact versioning, but there was a bug in how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).

For version 0.17.0, if you try to do this with the sparse=True option with a DataFrame, the column of zeros for the missing dummy variable will be a column of NaN, and it will be converted to dense.

回答3:

Try this:

In[1]: import pandas as pd
       cats = ["a", "b", "c"]

In[2]: df = pd.DataFrame({"cat": ["a", "b", "a"]})

In[3]: pd.concat((pd.get_dummies(df.cat, columns=cats), pd.DataFrame(columns=cats))).fillna(0)
Out[3]: 
     a    b    c
0  1.0  0.0  0
1  0.0  1.0  0
2  1.0  0.0  0

回答4:

I don't think get_dummies provides this out of the box, it only allows for creating an extra column that highlights NaN values.

To add the missing columns yourself, you could use pd.concat along axis=0 to vertically 'stack' the DataFrames (the dummy columns plus a DataFrame id) and automatically create any missing columns, use fillna(0) to replace missing values, and then use .groupby('id') to separate the various DataFrame again.

回答5:

I did ask this on the pandas github. Turns out it is really easy to get around it when you define the column as a Categorical where you define all the possible categories.

df['col'] = pd.Categorical(df['col'], categories=['a', 'b', 'c', 'd'])

get_dummies() will do the rest then as expected.

回答6:

Adding the missing category in the test set:

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

Notice that this code also remove column resulting from category in the test dataset but not present in the training dataset

回答7:

As suggested by others - Converting your Categorical features to 'category' data type should resolve the unseen label issue using 'get_dummies'.

# Your Data frame(df)
from sklearn.model_selection import train_test_split
X = df.loc[:,df.columns !='label']
Y = df.loc[:,df.columns =='label']

# Split the data into 70% training and 30% test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3) 

# Convert Categorical Columns in your data frame to type 'category'
for col in df.select_dtypes(include=[np.object]).columns:
    X_train[col] = X_train[col].astype('category', categories = df[col].unique())
    X_test[col] = X_test[col].astype('category', categories = df[col].unique())

# Now, use get_dummies on training, test data and we will get same set of columns
X_train = pd.get_dummies(X_train,columns = ["Categorical_Columns"])
X_test = pd.get_dummies(X_test,columns = ["Categorical_Columns"])

来源：https://stackoverflow.com/questions/37425961/dummy-variables-when-not-all-categories-are-present

标签

python

pandas

machine-learning

dummy-variable