ValueError: Number of features of the model must match the input

后端 未结 5 1232
谎友^
谎友^ 2020-12-14 12:11

I\'m getting this error when trying to predict using a model I built in scikit learn. I know that there are a bunch of questions about this but mine seems different from the

相关标签:
5条回答
  • 2020-12-14 12:54

    I tried the method suggested here and ended up with hot encoding the label column as well,and in the dataframe it is shown as 'label_test' and 'label_train' so just a heads up try this post get_dummies:

    train_df = feature_df[feature_df['label_train'] == 1]
    test_df = feature_df[feature_df['label_test'] == 0]
    train_df = train_df.drop(['label_train', 'label_test'], axis=1)
    test_df = test_df.drop(['label_train', 'label_test'], axis=1)
    
    0 讨论(0)
  • 2020-12-14 12:57

    You can utilize the Categorical Dtype to apply null values to unseen data.

    Input:

    import pandas as pd
    import numpy as np
    from pandas.api.types import CategoricalDtype
    
    # Create Example Data
    train = pd.DataFrame({"text":["A", "B", "C", "D", 'F', np.nan]})
    test = pd.DataFrame({"text":["D", "D", np.nan,"B", "E", "T"]})
    
    # Convert columns to category dtype and specify categories for test set
    train['text'] = train['text'].astype('category')
    test['text'] = test['text'].astype(CategoricalDtype(categories=train['text'].cat.categories))
    
    # Create Dummies
    pd.get_dummies(test['text'], dummy_na=True)
    

    Output:

    | A | B | C | D | F | nan |
    |---|---|---|---|---|-----|
    | 0 | 0 | 0 | 1 | 0 | 0   |
    | 0 | 0 | 0 | 1 | 0 | 0   |
    | 0 | 0 | 0 | 0 | 0 | 1   |
    | 0 | 1 | 0 | 0 | 0 | 0   |
    | 0 | 0 | 0 | 0 | 0 | 1   |
    | 0 | 0 | 0 | 0 | 0 | 1   |
    
    0 讨论(0)
  • 2020-12-14 13:02

    Below correction to original answer from Scratch'N'Purr would help solve issues one might face using string as value for new inserted column 'label' -
    train_df = pd.read_csv("Cinderella.csv") train_df['label'] = 1

        score_df = pandas.read_csv("Slaughterhouse_copy.csv")
        score_df['label'] = 2
    
        # Concat
        concat_df = pd.concat([train_df , score_df])
    
        # Create your dummies
        features_df = pd.get_dummies(concat_df)
    
        # Split your data
        train_df = features_df[features_df['label'] == '1]
        score_df = features_df[features_df['label'] == '2]
        ...
    
    0 讨论(0)
  • 2020-12-14 13:12

    The reason you're getting the error is due to the different distinct values in your features where you're generating the dummy values with get_dummies.

    Let's suppose the Word_1 column in your training set has the following distinct words: the, dog, jumps, roof, off. That's 5 distinct words so pandas will generate 5 features for Word_1. Now, if your scoring dataset has a different number of distinct words in the Word_1 column, then you're going to get a different number of features.

    How to fix:

    You'll want to concatenate your training and scoring datasets using concat, apply get_dummies, and then split your datasets. That'll ensure you have captured all the distinct values in your columns. Given that you're using two different csv's, you probably want to generate a column that specifies your training vs scoring dataset.

    Example solution:

    train_df = pd.read_csv("Cinderella.csv")
    train_df['label'] = 'train'
    
    score_df = pandas.read_csv("Slaughterhouse_copy.csv")
    score_df['label'] = 'score'
    
    # Concat
    concat_df = pd.concat([train_df , score_df])
    
    # Create your dummies
    features_df = pd.get_dummies(concat_df, columns=['Overall_Sentiment', 'Word_1','Word_2','Word_3','Word_4','Word_5','Word_6','Word_7','Word_8','Word_9','Word_10','Word_11','Word_1','Word_12','Word_13','Word_14','Word_15','Word_16','Word_17','Word_18','Word_19','Word_20','Word_21','Word_22','Word_23','Word_24','Word_25','Word_26','Word_27','Word_28','Word_29','Word_30','Word_31','Word_32','Word_33','Word_34','Word_35','Word_36','Word_37','Word_38','Word_39','Word_40','Word_41', 'Word_42', 'Word_43'], dummy_na=True)
    
    # Split your data
    train_df = features_df[features_df['label'] == 'train']
    score_df = features_df[features_df['label'] == 'score']
    
    # Drop your labels
    train_df = train_df.drop('label', axis=1)
    score_df = score_df.drop('label', axis=1)
    
    # Now delete your 'slope' feature, create your features matrix, and create your model as you have already shown in your example
    ...
    
    0 讨论(0)
  • 2020-12-14 13:12

    The size of the training data(excluding labels,however) which you fit to the model should be same as the size of the data which you are going to predict

    0 讨论(0)
提交回复
热议问题