replace missing values in categorical data

后端 未结 3 1284
孤独总比滥情好
孤独总比滥情好 2021-01-27 00:00

Let\'s suppose I have a column with categorical data \"red\" \"green\" \"blue\" and empty cells

red
green
red
blue
NaN

I\'m sure that the NaN b

相关标签:
3条回答
  • 2021-01-27 00:28

    In addition to Lan's answer's approach, which seems most commonly used, you can use something based on matrix factorization. For example there is a variant of Generalized Low Rank Models that can impute such data, just as probabilistic matrix factorization is used to impute continuous data.

    GLRMs can be used from H2O which provides bindings for both Python and R.

    0 讨论(0)
  • 2021-01-27 00:35

    It depends on what you want to do with the data. Is the average of these colors useful for your purpose? You are creating a new possible value doing that, that is probably not wanted. Especially since you are talking about categorical data, and you are handling it as if it was numeric data.

    In Machine Learning you would replace the missing values with the most common categorical value regarding a target attribute (what you want to predict).

    Example: You want to predict if a person is male or female by looking at their car, and the color feature has some missing values. If most of the cars from male(female) drivers are blue(red), you would use that value to fill missing entries of cars from male(female) drivers.

    0 讨论(0)
  • 2021-01-27 00:44

    The simplest strategy for handling missing data is to remove records that contain a missing value.

    The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values. Since it is categorical data, using mean as replacement value is not recommended. You can use

    from sklearn.preprocessing import Imputer
    imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
    

    The Imputer class operates directly on the NumPy array instead of the DataFrame.

    Last but not least, not ALL ML algorithm cannot handle missing value. Different implementations of ML also different.

    0 讨论(0)
提交回复
热议问题