问题
I have the following question: I have a pandas dataframe, in which missing values are marked by the string na
. I want to run an Imputer on it to replace the missing values with the mean in the column. According to the sklearn documentation, the parameter missing_values
should help me with this:
missing_values : integer or “NaN”, optional (default=”NaN”) The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan, use the string value “NaN”.
In my understanding, this means, that if I write
df = pd.read_csv(filename)
imp = Imputer(missing_values='na')
imp.fit_transform(df)
that would mean that the imputer replaces anything in the dataframe with the na
value with the mean of the column. However, instead, I get an error:
ValueError: could not convert string to float: na
What am I misinterpreting? Is this not how the imputer should work? How can I replace the na
strings with the mean, then? Should I just use a lambda for it?
Thank you!
回答1:
Since you say you want to replace these 'na'
by a the mean of the column, I'm guessing the non-missing values are indeed floats. The problem is that pandas does not recognize the string 'na'
as a missing value, and so reads the column with dtype object
instead of some flavor of float
.
Case in point, consider the following .csv
file:
test.csv
col1,col2
1.0,1.0
2.0,2.0
3.0,3.0
na,4.0
5.0,5.0
With the naive import df = pd.read_csv('test.csv')
, df.dtypes
tells us that col1
is of dtype object
and col2
is of dtype float64
. But how do you take the mean of a bunch of objects?
The solution is to tell pd.read_csv()
to interpret the string 'na'
as a missing value:
df = pd.read_csv('test.csv', na_values='na')
The resulting dataframe has both columns of dtype float64
, and you can now use your imputer.
回答2:
Here is the error I was receiving
IndexError: in the future, 0-d boolean arrays will be interpreted as a valid boolean index
In my case I had issue with "median" strategy, changing it to mean or most_frequent worked.
回答3:
first import pandas
then read the your_file_name.csv
. And iloc
is defined pandas.DataFrame.iloc and is purley integer based indexing for location by position . Here format is iloc[for row index , for column index]
where a,b,c,d are integers ab,c ,d can also be empty
import pandas as pd
dataSet = pd.read_csv('your_file_name.csv')
X = dataSet.iloc[ a:b , c:d].values
if you use without .values then you will not be able to make it used in imputer for transformation
here after importing Imputer
define your Imputer
parameters missing_values
=
"missing values in data that you want to replace " ,strategy ="mean"
(two more
strategies are there which it follows i.e. median and most frequently occured in
your dataSet but default is mean . Then set axis = (0 for column and 1 for row ) , other are copy and verbose ) you can read more about it on
from sklearn.preprocessing import Imputer
i = Imputer(missing_values="NaN", strategy="mean", axis=0)
fit the data into your defined way of Imputer and then transform it using transform method . this will return array of datatype = object
i = i.fit(X[a:b, c:d])
X[a:b, c:d ] = i.transform(X[a:b,c:d])
Remember here your selected columns show contain only float or integer type values otherwise this may show error can't convert string to float
回答4:
There are several things you need to pay attention here.
Make sure you are not imputing on type "object" or categorial variables, you can have a look on your data like this:
df = pd.read_csv(filename)
print(df.info(null_counts=True))
The last colunm should be the type
Let's see an example:
df = pd.DataFrame({'A' : [1, 2, 2, 2, 'NaN', 3, 4, 5, 6], 'B' : [3, 3, 'NaN', 3, 3, 4, 3, 3, 3]})
output:
df.head()
A B
---------
0 1 3
1 2 3
2 2 NaN
3 2 3
4 NaN 3
Now let's have a look on the types
df.info(null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
0 9 non-null float64
1 9 non-null float64
dtypes: float64(2)
memory usage: 224.0 bytes
Now imputing:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
df_imputed = pd.DataFrame(imputer.fit_transform(df))
df_imputed.head()
0 1
-----------
0 1.0 3.0
1 2.0 3.0
2 2.0 3.0
3 2.0 3.0
4 2.0 3.0
Now this is all good and well but cannot be done on categorial (type Object / String)
One way to handle it, is to change the Categorical features to numeric, something like this:
df_with_cat = pd.DataFrame({'A': ['ios', 'android', 'web', 'NaN'], 'B' : [4, 4, 'NaN', 2]})
df_with_cat.head()
A B
-------------
0 ios 4
1 android 4
2 web NaN
3 NaN 2
And info
df_with_cat.info(null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A 4 non-null object
B 4 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes
We know for sure that B is numerical so, let's do this:
df_with_cat['B'] = df_with_cat['B'].astype(np.float)
df_with_cat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A 4 non-null object
B 3 non-null float64
dtypes: float64(1), object(1)
memory usage: 144.0+ bytes
If we would use the very same imputer from above we'd get an error (you can try it out)
Now let's transform the 'A' categories to numbers:
CATEGORICAL_FEATURES = [
'A',
]
data_dum = pd.get_dummies(df_with_cat, columns=['A'], drop_first=True)
data_dum.head()
B A_android A_ios A_web
---------------------------------
0 4 0 1 0
1 4 1 0 0
2 NaN 0 0 1
3 2 0 0 0
Now we can run the very same Imputer from above on our data frame
来源:https://stackoverflow.com/questions/38150330/python-sklearn-imputer-usage