Python - SkLearn Imputer usage

问题

I have the following question: I have a pandas dataframe, in which missing values are marked by the string na. I want to run an Imputer on it to replace the missing values with the mean in the column. According to the sklearn documentation, the parameter missing_values should help me with this:

missing_values : integer or “NaN”, optional (default=”NaN”) The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan, use the string value “NaN”.

In my understanding, this means, that if I write

df = pd.read_csv(filename)
imp = Imputer(missing_values='na')
imp.fit_transform(df)

that would mean that the imputer replaces anything in the dataframe with the na value with the mean of the column. However, instead, I get an error:

ValueError: could not convert string to float: na

What am I misinterpreting? Is this not how the imputer should work? How can I replace the na strings with the mean, then? Should I just use a lambda for it?

Thank you!

回答1:

Since you say you want to replace these 'na' by a the mean of the column, I'm guessing the non-missing values are indeed floats. The problem is that pandas does not recognize the string 'na' as a missing value, and so reads the column with dtype object instead of some flavor of float.

Case in point, consider the following .csv file:

 test.csv

 col1,col2
 1.0,1.0
 2.0,2.0
 3.0,3.0
 na,4.0
 5.0,5.0

With the naive import df = pd.read_csv('test.csv'), df.dtypes tells us that col1 is of dtype object and col2 is of dtype float64. But how do you take the mean of a bunch of objects?

The solution is to tell pd.read_csv() to interpret the string 'na' as a missing value:

df = pd.read_csv('test.csv', na_values='na')

The resulting dataframe has both columns of dtype float64, and you can now use your imputer.

回答2:

Here is the error I was receiving

IndexError: in the future, 0-d boolean arrays will be interpreted as a valid boolean index

In my case I had issue with "median" strategy, changing it to mean or most_frequent worked.

回答3:

first import pandas then read the your_file_name.csv . And iloc is defined pandas.DataFrame.iloc and is purley integer based indexing for location by position . Here format is iloc[for row index , for column index] where a,b,c,d are integers ab,c ,d can also be empty

import pandas as pd
dataSet = pd.read_csv('your_file_name.csv')
X = dataSet.iloc[ a:b , c:d].values

if you use without .values then you will not be able to make it used in imputer for transformation
here after importing Imputer define your Imputer parameters missing_values =
"missing values in data that you want to replace " ,strategy ="mean" (two more strategies are there which it follows i.e. median and most frequently occured in your dataSet but default is mean . Then set axis = (0 for column and 1 for row ) , other are copy and verbose ) you can read more about it on

from sklearn.preprocessing import Imputer
i = Imputer(missing_values="NaN", strategy="mean", axis=0)

fit the data into your defined way of Imputer and then transform it using transform method . this will return array of datatype = object

i  = i.fit(X[a:b, c:d])
X[a:b, c:d ] = i.transform(X[a:b,c:d])

Remember here your selected columns show contain only float or integer type values otherwise this may show error can't convert string to float

回答4:

There are several things you need to pay attention here.

Make sure you are not imputing on type "object" or categorial variables, you can have a look on your data like this:

df = pd.read_csv(filename)

print(df.info(null_counts=True))

The last colunm should be the type

Let's see an example:

df = pd.DataFrame({'A' : [1, 2, 2, 2, 'NaN', 3, 4, 5, 6], 'B' : [3, 3, 'NaN', 3, 3, 4, 3, 3, 3]})

output:

df.head()


    A   B
---------
0   1   3
1   2   3
2   2   NaN
3   2   3
4   NaN 3

Now let's have a look on the types

df.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
0    9 non-null float64
1    9 non-null float64
dtypes: float64(2)
memory usage: 224.0 bytes

Now imputing:

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
df_imputed = pd.DataFrame(imputer.fit_transform(df))
df_imputed.head()


    0   1
-----------
0   1.0 3.0
1   2.0 3.0
2   2.0 3.0
3   2.0 3.0
4   2.0 3.0

Now this is all good and well but cannot be done on categorial (type Object / String)

One way to handle it, is to change the Categorical features to numeric, something like this:

df_with_cat = pd.DataFrame({'A': ['ios', 'android', 'web', 'NaN'], 'B' : [4, 4, 'NaN', 2]})
df_with_cat.head()


      A     B
-------------
0   ios     4
1   android 4
2   web     NaN
3   NaN     2

And info

df_with_cat.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A    4 non-null object
B    4 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes

We know for sure that B is numerical so, let's do this:

df_with_cat['B'] = df_with_cat['B'].astype(np.float)
df_with_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A    4 non-null object
B    3 non-null float64
dtypes: float64(1), object(1)
memory usage: 144.0+ bytes

If we would use the very same imputer from above we'd get an error (you can try it out)

Now let's transform the 'A' categories to numbers:

CATEGORICAL_FEATURES = [
    'A', 
]
data_dum = pd.get_dummies(df_with_cat, columns=['A'], drop_first=True)
data_dum.head()

    B   A_android   A_ios   A_web
---------------------------------
0   4       0         1       0
1   4       1         0       0
2   NaN     0         0       1
3   2       0         0       0

Now we can run the very same Imputer from above on our data frame

来源：https://stackoverflow.com/questions/38150330/python-sklearn-imputer-usage

标签

python

scikit-learn

imputation