I\'m trying to collapse rows in a dataframe that contains a column of ID data and a number of columns that each hold a different string. It looks like groupby is the solution, b
You can use groupby with aggregation ''.join
, sum
or max
:
#if blank values are NaN first replace to ''
df = df.fillna('')
df = df.groupby('ID').agg(''.join)
print (df)
apples pears oranges
ID
101 oranges
134 apples pears
576 pears oranges
837 apples
Also works:
df = df.fillna('')
df = df.groupby('ID').sum()
#alternatively max
#df = df.groupby('ID').max()
print (df)
apples pears oranges
ID
101 oranges
134 apples pears
576 pears oranges
837 apples
Also if need remove duplicates per group and per column add unique:
df = df.groupby('ID').agg(lambda x: ''.join(x.unique()))
Assuming blanks are ''
option 1
pivot_table
df.pivot_table(['apples', 'pears', 'oranges'], 'ID', aggfunc=''.join)
option 2
sort
and take last row as ''
will be sorted first
def f(df):
return pd.DataFrame(np.sort(df.values, 0)[[-1]], [df.name], df.columns)
df.set_index(
'ID', append=True
).groupby(level='ID', group_keys=False).apply(f)
Both yield
apples oranges pears
ID
101 oranges
134 apples pears
576 oranges pears
837 apples