Python Pandas: create a new column for each different value of a source column (with boolean output as column values)

雨燕双飞 提交于 2019-12-02 05:59:14

问题


I am trying to split a source column of a dataframe in several columns based on its content, and then fill this newly generated columns with a boolean 1 or 0 in the following way:

Original dataframe:

ID   source_column
A    value 1
B    NaN
C    value 2
D    value 3
E    value 2

Generating the following output:

ID   source_column    value 1    value 2    value 3
A    value 1          1          0          0
B    NaN              0          0          0
C    value 2          0          1          0
D    value 3          0          0          1
E    value 2          0          1          0

I thought about manually create each different column, and then with a function for each column and .apply, filling the newly column with a 1 or a 0, but this is highly ineffective.

Is there a quick and efficient way for this?


回答1:


You can try:

df = pd.get_dummies(df, columns=['source_column'])

or if you prefer sklearn

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
matrix=enc.fit_transform(df['source_column'])



回答2:


You can use the pandas function get_dummies, and add the result to df as shown below

In [1]: col_names = df['source_column'].dropna().unique().tolist()

In [2]: df[col_names] = pd.get_dummies(df['source_column'])

In [3]: df
Out[3]: 
  ID source_column  value 1  value 2  value 3
0  A       value 1        1        0        0
1  B          NaN         0        0        0
2  C       value 2        0        1        0
3  D       value 3        0        0        1
4  E       value 2        0        1        0



回答3:


So there is this possibility (a little bit hacky).

Reading the DataFrame from your example data:

In [4]: df = pd.read_clipboard().drop("ID", axis=1)

In [5]: df
Out[5]:
   source_column
A            1.0
B            NaN
C            2.0
D            3.0
E            2.0

After that, adding a new column with df['foo'] = 1.

Then work with unstacking:

In [22]: df.reset_index().set_index(['index', 'source_column']).unstack().fillna(0).rename_axis([None]).astype(int)
Out[22]:
              foo
source_column NaN 1.0 2.0 3.0
A               0   1   0   0
B               1   0   0   0
C               0   0   1   0
D               0   0   0   1
E               0   0   1   0

You then of course have to rename your columns and drop the Nancol, but that should fulfill your needs in a first run.

EDIT:

Other approach to suppress the nan column, you can use groupby+value_counts (kind of hacky too):

In [30]: df.reset_index().groupby("index").source_column.value_counts().unstack().fillna(0).astype(int).rename_axis([None])
Out[30]:
source_column  1.0  2.0  3.0
A                1    0    0
C                0    1    0
D                0    0    1
E                0    1    0

This is the same idea (unstacking) but suppresses the nan values to be considered by default. You of course have to merge it on the original dataframe to keep the rows with the nan values if you want that. So at all, both approaches work fine, you can choose the one which fulfills your needs best.




回答4:


pd.concat([df,pd.crosstab(df.index,df.source_column)],1).fillna(0)

Out[1028]: 
  ID source_column  value1  value2  value3
0  A        value1     1.0     0.0     0.0
1  B             0     0.0     0.0     0.0
2  C        value2     0.0     1.0     0.0
3  D        value3     0.0     0.0     1.0
4  E        value2     0.0     1.0     0.0


来源:https://stackoverflow.com/questions/48646739/python-pandas-create-a-new-column-for-each-different-value-of-a-source-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!