I\'m getting this error when trying to predict using a model I built in scikit learn. I know that there are a bunch of questions about this but mine seems different from the
I tried the method suggested here and ended up with hot encoding the label column as well,and in the dataframe it is shown as 'label_test' and 'label_train' so just a heads up try this post get_dummies:
train_df = feature_df[feature_df['label_train'] == 1]
test_df = feature_df[feature_df['label_test'] == 0]
train_df = train_df.drop(['label_train', 'label_test'], axis=1)
test_df = test_df.drop(['label_train', 'label_test'], axis=1)
You can utilize the Categorical Dtype to apply null values to unseen data.
Input:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
# Create Example Data
train = pd.DataFrame({"text":["A", "B", "C", "D", 'F', np.nan]})
test = pd.DataFrame({"text":["D", "D", np.nan,"B", "E", "T"]})
# Convert columns to category dtype and specify categories for test set
train['text'] = train['text'].astype('category')
test['text'] = test['text'].astype(CategoricalDtype(categories=train['text'].cat.categories))
# Create Dummies
pd.get_dummies(test['text'], dummy_na=True)
Output:
| A | B | C | D | F | nan |
|---|---|---|---|---|-----|
| 0 | 0 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 1 |
Below correction to original answer from Scratch'N'Purr would help solve issues one might face using string as value for new inserted column 'label' -
train_df = pd.read_csv("Cinderella.csv")
train_df['label'] = 1
score_df = pandas.read_csv("Slaughterhouse_copy.csv")
score_df['label'] = 2
# Concat
concat_df = pd.concat([train_df , score_df])
# Create your dummies
features_df = pd.get_dummies(concat_df)
# Split your data
train_df = features_df[features_df['label'] == '1]
score_df = features_df[features_df['label'] == '2]
...
The reason you're getting the error is due to the different distinct values in your features where you're generating the dummy values with get_dummies
.
Let's suppose the Word_1
column in your training set has the following distinct words: the, dog, jumps, roof, off
. That's 5 distinct words so pandas will generate 5 features for Word_1
. Now, if your scoring dataset has a different number of distinct words in the Word_1
column, then you're going to get a different number of features.
How to fix:
You'll want to concatenate your training and scoring datasets using concat, apply get_dummies
, and then split your datasets. That'll ensure you have captured all the distinct values in your columns. Given that you're using two different csv's, you probably want to generate a column that specifies your training vs scoring dataset.
Example solution:
train_df = pd.read_csv("Cinderella.csv")
train_df['label'] = 'train'
score_df = pandas.read_csv("Slaughterhouse_copy.csv")
score_df['label'] = 'score'
# Concat
concat_df = pd.concat([train_df , score_df])
# Create your dummies
features_df = pd.get_dummies(concat_df, columns=['Overall_Sentiment', 'Word_1','Word_2','Word_3','Word_4','Word_5','Word_6','Word_7','Word_8','Word_9','Word_10','Word_11','Word_1','Word_12','Word_13','Word_14','Word_15','Word_16','Word_17','Word_18','Word_19','Word_20','Word_21','Word_22','Word_23','Word_24','Word_25','Word_26','Word_27','Word_28','Word_29','Word_30','Word_31','Word_32','Word_33','Word_34','Word_35','Word_36','Word_37','Word_38','Word_39','Word_40','Word_41', 'Word_42', 'Word_43'], dummy_na=True)
# Split your data
train_df = features_df[features_df['label'] == 'train']
score_df = features_df[features_df['label'] == 'score']
# Drop your labels
train_df = train_df.drop('label', axis=1)
score_df = score_df.drop('label', axis=1)
# Now delete your 'slope' feature, create your features matrix, and create your model as you have already shown in your example
...
The size of the training data(excluding labels,however) which you fit to the model should be same as the size of the data which you are going to predict