I have the following question: I have a pandas dataframe, in which missing values are marked by the string na
. I want to run an Imputer on it to replace the missing values with the mean in the column. According to the sklearn documentation, the parameter missing_values
should help me with this:
missing_values : integer or “NaN”, optional (default=”NaN”) The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan, use the string value “NaN”.
In my understanding, this means, that if I write
df = pd.read_csv(filename)
imp = Imputer(missing_values='na')
that would mean that the imputer replaces anything in the dataframe with the na
value with the mean of the column. However, instead, I get an error:
ValueError: could not convert string to float: na
What am I misinterpreting? Is this not how the imputer should work? How can I replace the na
strings with the mean, then? Should I just use a lambda for it?
Thank you!
Since you say you want to replace these 'na'
by a the mean of the column, I'm guessing the non-missing values are indeed floats. The problem is that pandas does not recognize the string 'na'
as a missing value, and so reads the column with dtype object
instead of some flavor of float
Case in point, consider the following .csv
With the naive import df = pd.read_csv('test.csv')
, df.dtypes
tells us that col1
is of dtype object
and col2
is of dtype float64
. But how do you take the mean of a bunch of objects?
The solution is to tell pd.read_csv()
to interpret the string 'na'
as a missing value:
df = pd.read_csv('test.csv', na_values='na')
The resulting dataframe has both columns of dtype float64
, and you can now use your imputer.
Here is the error I was receiving
IndexError: in the future, 0-d boolean arrays will be interpreted as a valid boolean index
In my case I had issue with "median" strategy, changing it to mean or most_frequent worked.
first import pandas
then read the your_file_name.csv
. And iloc
is defined pandas.DataFrame.iloc and is purley integer based indexing for location by position . Here format is iloc[for row index , for column index]
where a,b,c,d are integers ab,c ,d can also be empty
import pandas as pd
dataSet = pd.read_csv('your_file_name.csv')
X = dataSet.iloc[ a:b , c:d].values
if you use without .values then you will not be able to make it used in imputer for transformation
here after importing Imputer
define your Imputer
parameters missing_values
"missing values in data that you want to replace " ,strategy ="mean"
(two more
strategies are there which it follows i.e. median and most frequently occured in
your dataSet but default is mean . Then set axis = (0 for column and 1 for row ) , other are copy and verbose ) you can read more about it on
from sklearn.preprocessing import Imputer
i = Imputer(missing_values="NaN", strategy="mean", axis=0)
fit the data into your defined way of Imputer and then transform it using transform method . this will return array of datatype = object
i = i.fit(X[a:b, c:d])
X[a:b, c:d ] = i.transform(X[a:b,c:d])
Remember here your selected columns show contain only float or integer type values otherwise this may show error can't convert string to float
There are several things you need to pay attention here.
Make sure you are not imputing on type "object" or categorial variables, you can have a look on your data like this:
df = pd.read_csv(filename)
The last colunm should be the type
Let's see an example:
df = pd.DataFrame({'A' : [1, 2, 2, 2, 'NaN', 3, 4, 5, 6], 'B' : [3, 3, 'NaN', 3, 3, 4, 3, 3, 3]})
0 1 3
1 2 3
2 2 NaN
3 2 3
4 NaN 3
Now let's have a look on the types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
0 9 non-null float64
1 9 non-null float64
dtypes: float64(2)
memory usage: 224.0 bytes
Now imputing:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
df_imputed = pd.DataFrame(imputer.fit_transform(df))
0 1
0 1.0 3.0
1 2.0 3.0
2 2.0 3.0
3 2.0 3.0
4 2.0 3.0
Now this is all good and well but cannot be done on categorial (type Object / String)
One way to handle it, is to change the Categorical features to numeric, something like this:
df_with_cat = pd.DataFrame({'A': ['ios', 'android', 'web', 'NaN'], 'B' : [4, 4, 'NaN', 2]})
0 ios 4
1 android 4
2 web NaN
3 NaN 2
And info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A 4 non-null object
B 4 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes
We know for sure that B is numerical so, let's do this:
df_with_cat['B'] = df_with_cat['B'].astype(np.float)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A 4 non-null object
B 3 non-null float64
dtypes: float64(1), object(1)
memory usage: 144.0+ bytes
If we would use the very same imputer from above we'd get an error (you can try it out)
Now let's transform the 'A' categories to numbers:
data_dum = pd.get_dummies(df_with_cat, columns=['A'], drop_first=True)
B A_android A_ios A_web
0 4 0 1 0
1 4 1 0 0
2 NaN 0 0 1
3 2 0 0 0
Now we can run the very same Imputer from above on our data frame