Implementation of sklearn.impute.IterativeImputer

问题

Consider data which contains some nan below:

Column-1    Column-2    Column-3    Column-4    Column-5
0   NaN 15.0    63.0    8.0 40.0
1   60.0    51.0    NaN 54.0    31.0
2   15.0    17.0    55.0    80.0    NaN
3   54.0    43.0    70.0    16.0    73.0
4   94.0    31.0    94.0    29.0    53.0
5   99.0    52.0    77.0    91.0    58.0
6   84.0    19.0    36.0    NaN 97.0
7   41.0    91.0    62.0    67.0    68.0
8   44.0    38.0    27.0    53.0    37.0
9   58.0    NaN 63.0    57.0    28.0
10  66.0    68.0    89.0    36.0    47.0
11  7.0 81.0    5.0 99.0    16.0
12  43.0    55.0    64.0    88.0    NaN
13  8.0 90.0    91.0    44.0    4.0
14  29.0    52.0    94.0    71.0    47.0
15  22.0    21.0    68.0    61.0    38.0
16  76.0    36.0    70.0    99.0    50.0
17  38.0    31.0    66.0    79.0    99.0
18  94.0    22.0    92.0    39.0    58.0

I want to replace nan in the data using sklearn.impute.IterativeImputer. A friend helped me with the code below:

imp = IterativeImputer(missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=4, initial_strategy='median')
imp.fit(data)
imputed_data = pd.DataFrame(data=imp.transform(data), 
                             columns=['Column-1', 'Column-2', 'Column-3', 'Column-4', 'Column-5'],
                             dtype='int')

The imputed_data is:


Column-1    Column-2    Column-3    Column-4    Column-5
0   59  15  63  8   40
1   60  51  66  54  31
2   15  17  55  80  48
3   54  43  70  16  73
4   94  31  94  29  53
5   99  52  77  91  58
6   84  19  36  59  97
7   41  91  62  67  68
8   44  38  27  53  37
9   58  46  63  57  28
10  66  68  89  36  47
11  7   81  5   99  16
12  43  55  64  88  47
13  8   90  91  44  4
14  29  52  94  71  47
15  22  21  68  61  38
16  76  36  70  99  50
17  38  31  66  79  99
18  94  22  92  39  58

From the IterativeImputer documentation, the default estimator is BayesianRidge(). But if I use other estimators such as estimator=ExtraTreesRegressor(n_estimators=10, random_state=0) like in the code below, it returns a warning message. The code:

imp = IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=0), missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=4, initial_strategy='median')
imp.fit(data)

The message:

C:\Users\...\sklearn\impute\_iterative.py:599: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached. " reached.", ConvergenceWarning).

My question: is this a correct approach or should I do something to fix the warning message?
Thank you.

回答1:

Have you tried to import ExtraTreesRegressor first. It should work fine.

from sklearn.ensemble import ExtraTreesRegressor.

Also check for the version of scikit learn. It should be 0.21.1 and above.

回答2:

They are having the same issue here:

https://github.com/scikit-learn/scikit-learn/issues/14338

来源：https://stackoverflow.com/questions/57154209/implementation-of-sklearn-impute-iterativeimputer

标签

python

dataframe

scikit-learn

missing-data

imputation