问题
I have a dataframe where one column is short_names
. short_names
consist of 2-5 letters of names => BG
,OP
,LE
,WEL
,LC
. Each row can have any number of names.
I am trying to use MultiLabelBinarizer
to convert the names into individual columns such that if the rows have similar names then there will be 1 in the columns
one_hot = MultiLabelBinarizer()
one_hot.fit_transform(df['short_name'])
one_hot.classes__
Because there is a '-' in one the rows which result in an error TypeError: 'float' object is not iterable
, I have used
df['short_names']= df['short_names'].astype(str)
The issue now is that the classes output is letters instead of the short names i.e. A
, B
, C
instead of BG
OP
回答1:
I think need dropna for remove missing values with split if necessary:
df = pd.Series({0: np.nan, 1: 'CE', 2: 'NPP', 4: 'SE, CB, CBN, OOM, BCI', 5: 'RCS'})
.to_frame('short_name')
print (df)
short_name
0 NaN
1 CE
2 NPP
4 SE, CB, CBN, OOM, BCI
5 RCS
from sklearn.preprocessing import MultiLabelBinarizer
one_hot = MultiLabelBinarizer()
a = one_hot.fit_transform(df['short_name'].dropna().str.split(', '))
print (a)
[[0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0]
[1 1 1 0 0 1 0 1]
[0 0 0 0 0 0 1 0]]
print(one_hot.classes_ )
['BCI' 'CB' 'CBN' 'CE' 'NPP' 'OOM' 'RCS' 'SE']
If want output DataFrame
:
df = pd.DataFrame(a, columns=one_hot.classes_ )
print (df)
BCI CB CBN CE NPP OOM RCS SE
0 0 0 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0
2 1 1 1 0 0 1 0 1
3 0 0 0 0 0 0 1 0
Another solution is replace missing values
by fillna:
from sklearn.preprocessing import MultiLabelBinarizer
one_hot = MultiLabelBinarizer()
a = one_hot.fit_transform(df['short_name'].fillna('missing').str.split(', '))
print (a)
[[0 0 0 0 0 0 0 0 1]
[0 0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0]
[1 1 1 0 0 1 0 1 0]
[0 0 0 0 0 0 1 0 0]]
print(one_hot.classes_ )
['BCI' 'CB' 'CBN' 'CE' 'NPP' 'OOM' 'RCS' 'SE' 'missing']
df = pd.DataFrame(a, columns=one_hot.classes_ )
print (df)
BCI CB CBN CE NPP OOM RCS SE missing
0 0 0 0 0 0 0 0 0 1
1 0 0 0 1 0 0 0 0 0
2 0 0 0 0 1 0 0 0 0
3 1 1 1 0 0 1 0 1 0
4 0 0 0 0 0 0 1 0 0
来源:https://stackoverflow.com/questions/51335535/multilabelbinarizer-output-classes-in-letters-instead-of-categories