Categorical and ordinal feature data representation in regression analysis?

问题

I am trying to fully understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear:

Categorical feature and data example:
Color: red, white, black
Why categorical: red < white < black is logically incorrect

Ordinal feature and data example:
Condition: old, renovated, new
Why ordinal: old < renovated < new is logically correct

Categorical-to-numeric and ordinal-to-numeric encoding methods:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data

Categorical data to numeric:

data = {'color': ['blue', 'green', 'green', 'red']}

Numeric format after One-Hot encoding:

   color_blue  color_green  color_red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

Ordinal data to numeric:

data = {'con': ['old', 'new', 'new', 'renovated']}

Numeric format after using mapping: Old < renovated < new → 0, 1, 2

In my data I have 'color' feature. As color changes from white to black price increases. From above mentioned rules I probably have to use one-hot encoding for categorical 'color' data. But why I cannot use ordinal representation. Below I provided my observations from where my question arised.

Let me start with introducing formula for linear regression:
Let have a look at data representations for color: Let's predict price for 1-st and 2-nd item using formula for both data representations:
One-hot encoding: In this case different thetas for different colors will exist. I assume that thetas already derived from regression (20, 50 and 100). Prediction will be:

Price (1 item) = 0 + 20*1 + 50*0 + 100*0 = 20$  (thetas are assumed for example)
Price (2 item) = 0 + 20*0 + 50*1 + 100*0 = 50$

Ordinal encoding for color: In this case all colors will have 1 common theta but my assigned multipliers (10, 20, 30) differ:

Price (1 item) = 0 + 20*10 = 200$  (theta assumed for example)
Price (2 item) = 0 + 20*20 = 400$  (theta assumed for example)

In my model White < Red < Black in prices. Seem to be that correlation works correctly and it is logical predictions in both cases. For ordinal and categorical representations. So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division in data representations is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?

回答1:

So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division in data representations is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?

You can do anything. The question is what will probably work better? And the answer is you should use representation which embeeds correct information about data structure and does not embeed false assumptions. What does it mean here?

If your data is categorical and you use number format you embed false structure (as there is no ordering of categorical data)
If your data is oridinal and you use one-hoe encoding you do not embed true structure (as there is an ordering and you ignore it).

So why does both format "work" in your case? Because your problem is trivial and in fact incorrectly stated. You analyze how well are predicted training samples and in fact, given some overfitting model you will always get perfect score on training data, no matter what representation is. In fact what you have done is show that there exists theta which makes thing right. And yes, if there exists theta (in linear models) which works for oridinal ones - there will always be one for the one-hot. The thing is - you will be much more likely to miss it while training your model. It is not software oriented problem, it is a learning oriented problem.

In practise, however, it would not happen. Once you would introduce actual problem, with lots of data, which might be noisy, uncertain etc. you would get better scores using representation which has something to do with nature of the problem (here - oridinal) with less effort then using representation which does not include it (here - one hot). Why? Because this knowledge of being ordinal can be infered (learned) from the data by the model, however you will need much more training data to do so. So why do this if you can embed this information directly into the data structure thus leading to easier learning problem? Learning in ML is actually hard, do not make it even harder. On the other hand always remember that you have to be sure that knowledge you embed is indeed true, because it might be hard to learn a relation from the data, but it is even harder to learn real patterns from false relations.

来源：https://stackoverflow.com/questions/34087329/categorical-and-ordinal-feature-data-representation-in-regression-analysis

标签

machine-learning

regression

linear-regression

categorical-data