Getting 'ValueError: shapes not aligned' on SciKit Linear Regression

前端 未结 3 805
死守一世寂寞
死守一世寂寞 2021-02-08 20:44

Quite new to SciKit and linear algebra/machine learning with Python in general, so I can\'t seem to solve the following:

I have a training set and a test set of data, co

相关标签:
3条回答
  • 2021-02-08 20:55

    This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.

    One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.

    Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.

    The second approach is what you mention in your question, so I'll go through it with pandas.

    By using get_dummies you're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:

    test_encoded = pd.get_dummies(test_data, columns=['your columns'])
    test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns, 
        fill_value=0)
    

    This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.

    You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.

    0 讨论(0)
  • 2021-02-08 20:59

    For anyone interested: I ended up merging the train and test set, then generating the dummies, and then splitting the data again at exactly the same fraction. That way there wasn't any issue with different shapes anymore, as it generated exactly the same dummy data.

    0 讨论(0)
  • 2021-02-08 21:06

    This works for me:
    Initially, I was getting this error message:

    shapes (15754,3) and (4, ) not aligned 
    

    I found out that, I was creating a model using 3 variables in my train data. But what I add constant X_train = sm.add_constant(X_train) the constant variable is automatically gets created. So, in total there are now 4 variables.
    And when you test this model by default the test variable has 3 variables. So, the error gets pops up for dimension miss match.
    So, I used the trick that creates a dummy variable for y_test also.

    `X_test = sm.add_constant(X_test)`
    

    Though this a useless variable, but this solves all the issue.

    0 讨论(0)
提交回复
热议问题