Specify list of possible values for Pandas get_dummies

后端 未结 4 1285
半阙折子戏
半阙折子戏 2021-02-14 17:36

Suppose I have a Pandas DataFrame like the below and I\'m encoding categorical_1 for training in scikit-learn:

data = {\'numeric_1\':[12.1, 3.2, 5.5, 6.8, 9.9],          


        
相关标签:
4条回答
  • 2021-02-14 17:55

    Isn't this a better answer?

    data = pd.DataFrame({
        "values": [1, 2, 3, 4, 5, 6, 7],
        "categories": ["A", "A", "B", "B", "C", "C", "D"]
    })
    
    possibilites = ["A", "B", "C", "D", "E", "F"]
    
    exists = data["categories"].tolist()
    
    difference = pd.Series([item for item in possibilites if item not in exists])
    
    target = data["categories"].append(pd.Series(difference))
    
    target = target.reset_index(drop=True)
    
    dummies = pd.get_dummies(
        target
    )
    
    dummies = dummies.drop(dummies.index[list(range(len(dummies)-len(difference), len(dummies)))])
    
    0 讨论(0)
  • 2021-02-14 18:06

    I encountered the same problem as yours, that is how to unify the dummy categories between training data and testing data when using get_dummies() in Pandas. Then I found a solution when exploring the House Price competition in Kaggle, that is to process training data and testing data at the same time. Suppose you have two dataframes df_train and df_test (not containing target data in them).

    all_data = pd.concat([df_train,df_test], axis=0)
    all_data = pd.get_dummies(all_data) 
    X_train  = all_data[:df_train.shape[0]]  # select the processed training data  
    X_test   = all_data[-df_test.shape[0]:]  # select the processed testing data
    

    Hope it helps.

    0 讨论(0)
  • 2021-02-14 18:06

    To handle the mismatch between the set of categorical values in train and test sets I used;

        length = train_categorical_data.shape[0]
        empty_col = np.zeros((length,1))
        test_categorical_data_processed = pd.DataFrame()
        for col in train_categorical_data.columns:
            test_categorical_data_processed[col] = test_categorical_data.get(col, empty_col)
    
    0 讨论(0)
  • 2021-02-14 18:11

    First, if you want pandas to take more values simply add them to the list sent to the get_dummies method

    data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9], 
            'categorical_1':['A', 'B', 'C', 'B', 'B']}
    frame = pd.DataFrame(data)
    dummy_values = pd.get_dummies(data['categorical_1'] + ['D','E'])
    

    as in python + on lists works as a concatenate operation, so

    ['A','B','C','B','B'] + ['D','E']
    

    results in

    ['A', 'B', 'C', 'B', 'B', 'D', 'E']
    

    In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.

    From the machine learning perspective, it is quite redundant. This column is a categorical one, so value 'D' means completely nothing to the model, that never seen it before. If you are coding the features unary (which I assume after seeing that you create columns for each value) it is enough to simply represent these 'D', 'E' values with

    A   B   C
    0   0   0
    

    (i assume that you represent the 'B' value with 0 1 0, 'C' with 0 0 1 etc.)

    because if there were no such values in the training set, during testing - no model will distinguish between giving value 'D', or 'Elephant'

    The only reason for such action would be to assume, that in the future you wish to add data with 'D' values, and simply do not want to modify the code, then it is reasonable to do it now, even though it could make training a bit more complex (as you add a dimension that as for now - carries completely no knowledge), but it seems a small problem.

    If you are not going to encode it in the unary format, but rather want to use these values as one feature, simply with categorical values, then you would not need to create these "dummies" at all, and use a model which can work with such values, such as Naive Bayes, which could simply be trained with "Laplacian smoothing" to be able to work around non-existing values.

    0 讨论(0)
提交回复
热议问题