Why does sklearn Imputer need to fit?

后端 未结 1 1785
轻奢々
轻奢々 2021-02-01 16:28

I\'m really new in this whole machine learning thing and I\'m taking an online course on this subject. In this course, the instructors showed the following piece of code:

<
相关标签:
1条回答
  • 2021-02-01 16:52

    The Imputer fills missing values with some statistics (e.g. mean, median, ...) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.

    from sklearn.preprocessing import Imputer
    obj = Imputer(strategy='mean')
    
    obj.fit([[1, 2, 3], [2, 3, 4]])
    print(obj.statistics_)
    # array([ 1.5,  2.5,  3.5])
    
    X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]])
    print(X)
    # array([[ 4. ,  2.5,  6. ],
    #        [ 5. ,  6. ,  3.5]])
    

    You can do both steps in one if your train and test data are identical, using fit_transform.

    X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]])
    print(X)
    # array([[ 1. ,  2. ,  4. ],
    #        [ 2. ,  3. ,  4. ]])
    

    This data leakage issue is important, since the data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.

    See the doc for more information about cross-validation.

    0 讨论(0)
提交回复
热议问题