问题
I want to turn my dataframe with non-distinct values underneath each column header into a dataframe with distinct values underneath each column header with next to it their occurrence in their particular column. An example:
My initial dataframe is visible underneath:
A B C D
0 CEN T2 56
2 DECEN T2 45
3 ONBEK T2 84
NaN CEN T1 59
3 NaN T1 87
NaN NaN T2 NaN
0 NaN NaN 98
NaN CEN NaN 23
NaN CEN T1 65
where A, B, C and D are the column headers with each 9 values underneath it (blanks included).
My preferred output dataframe should look like: (first a column of unique values for each column in the original dataframe and next to it their occurrence in that particular column)
A B C D A B C D
0 CEN T2 56 2 4 4 1
2 DECEN T1 45 1 1 3 1
3 ONBEK NaN 84 2 1 NaN 1
Nan NaN NaN 59 NaN NaN NaN 1
NaN NaN NaN 87 NaN NaN NaN 1
NaN NaN NaN 98 NaN NaN NaN 1
NaN NaN NaN 23 NaN NaN NaN 1
NaN NaN NaN 65 NaN NaN NaN 1
where A, B, C and D are the column headers with underneath them first the distinct values for each column from the original .csv-file and next to it the occurence of each element in their particular column.
Anybody ideas?
The code below is used to get the unique values out of each column into a new dataframe. I tried to do something with .value_counts to get the occurrence in each column but there I failed to get it into one dataframe again with the unique values..
df
new_df=pd.concat([pd.Series(df[i].unique()) for i in df.columns], axis=1)
new_df.columns=df.columns
new_df
回答1:
The difficult part is keeping values of columns in each row aligned. To do this, you need to construct a new dataframe from unique
, and pd.concat
on with value_counts
map to each column of this new dataframe.
new_df = (pd.DataFrame([df[c].unique() for c in df], index=df.columns).T
.dropna(how='all'))
df_final = pd.concat([new_df, *[new_df[c].map(df[c].value_counts()).rename(f'{c}_Count')
for c in df]], axis=1).reset_index(drop=True)
Out[1580]:
A B C D A_Count B_Count C_Count D_Count
0 0 CEN T2 56 2.0 4.0 4.0 1
1 2 DECEN T1 45 1.0 1.0 3.0 1
2 3 ONBEK NaN 84 2.0 1.0 NaN 1
3 NaN NaN NaN 59 NaN NaN NaN 1
4 NaN NaN NaN 87 NaN NaN NaN 1
5 NaN NaN NaN 98 NaN NaN NaN 1
6 NaN NaN NaN 23 NaN NaN NaN 1
7 NaN NaN NaN 65 NaN NaN NaN 1
If you only need to keep alignment between each pair of column and its count such as A
- A_Count
, B
- B_Count
..., it simply just use value_counts
with reset_index
some commands to change axis names
cols = df.columns.tolist() + (df.columns + '_Count').tolist()
new_df = pd.concat([df[col].value_counts(sort=False).rename_axis(col).reset_index(name=f'{col}_Count')
for col in df], axis=1).reindex(new_cols, axis=1)
Out[1501]:
A B C D A_Count B_Count C_Count D_Count
0 0.0 ONBEK T2 56.0 2.0 1.0 4.0 1
1 2.0 CEN T1 45.0 1.0 4.0 3.0 1
2 3.0 DECEN NaN 84.0 2.0 1.0 NaN 1
3 NaN NaN NaN 59.0 NaN NaN NaN 1
4 NaN NaN NaN 87.0 NaN NaN NaN 1
5 NaN NaN NaN 98.0 NaN NaN NaN 1
6 NaN NaN NaN 23.0 NaN NaN NaN 1
7 NaN NaN NaN 65.0 NaN NaN NaN 1
来源:https://stackoverflow.com/questions/60230730/get-unique-values-and-their-occurrence-out-of-one-dataframe-into-a-new-dataframe