I have this pandas dataframe from a query:
| name | event |
----------------------------
| name_1 | event_1 |
| name_1 | event_2 |
|
Some ways of doing it
1)
In [366]: pd.crosstab(df.name, df.event)
Out[366]:
event event_1 event_2
name
name_1 1 1
name_2 1 0
2)
In [367]: df.groupby(['name', 'event']).size().unstack(fill_value=0)
Out[367]:
event event_1 event_2
name
name_1 1 1
name_2 1 0
3)
In [368]: df.pivot_table(index='name', columns='event', aggfunc=len, fill_value=0)
Out[368]:
event event_1 event_2
name
name_1 1 1
name_2 1 0
4)
In [369]: df.assign(v=1).pivot(index='name', columns='event', values='v').fillna(0)
Out[369]:
event event_1 event_2
name
name_1 1.0 1.0
name_2 1.0 0.0
Option 1
pir1
and pir1_5
df.set_index('name').event.str.get_dummies()
event_1 event_2
name
name_1 1 0
name_1 0 1
name_2 1 0
Then you could sum across the index
df.set_index('name').event.str.get_dummies().sum(level=0)
event_1 event_2
name
name_1 1 1
name_2 1 0
Option 2
pir2
Or you could dot product
pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))
event_1 event_2
name_1 1 1
name_2 1 0
Option 3
pir3
Advanced Mode
i, r = pd.factorize(df.name.values)
j, c = pd.factorize(df.event.values)
n, m = r.size, c.size
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
pd.DataFrame(b, r, c)
event_1 event_2
name_1 1 1
name_2 1 0
Timing
res.plot(loglog=True)
res.div(res.min(1), 0)
pir1 pir2 pir3 john1 john2 john3
10 9.948396 3.399913 1.0 20.478368 4.460466 10.642113
30 9.350524 2.681178 1.0 16.589248 3.847666 9.168907
100 11.414536 3.079463 1.0 18.076040 4.277752 9.949305
300 15.769594 2.940529 1.0 16.745889 3.945470 9.069265
1000 26.869451 2.617564 1.0 12.789570 3.236390 7.279205
3000 42.229542 2.099541 1.0 8.716600 2.429847 4.785814
10000 52.571678 1.716088 1.0 4.597598 1.691989 2.800455
30000 58.644764 1.469827 1.0 2.818744 1.535012 1.929452
Functions
pir1 = lambda df: df.set_index('name').event.str.get_dummies().sum(level=0)
pir1_5 = lambda df: pd.get_dummies(df.set_index('name').event).sum(level=0)
pir2 = lambda df: pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))
def pir3(df):
i, r = pd.factorize(df.name.values)
j, c = pd.factorize(df.event.values)
n, m = r.size, c.size
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
return pd.DataFrame(b, r, c)
john1 = lambda df: pd.crosstab(df.name, df.event)
john2 = lambda df: df.groupby(['name', 'event']).size().unstack(fill_value=0)
john3 = lambda df: df.pivot_table(index='name', columns='event', aggfunc='size', fill_value=0)
Test
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='pir1 pir2 pir3 john1 john2 john3'.split(),
dtype=float
)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
You are asking for the pythonic ways , i think in python this way is to use a technic called one-hot encoding this technic is well implemented in libraries likes sklearn and after one hot encoding you will need to group your dataframe by the first column and apply sum function.
here is a code :
import pandas as pd #the useful libraries
import numpy as np
from sklearn.preprocessing import LabelBinarizer #form sklmearn
dataset = pd.DataFrame([['name_1', 'event_1' ], ['name_1', 'event_2'], ['name_2', 'event_1']], columns=['name', 'event'], index=[1, 2, 3])
data = dataset['event'] #just reproduce your dataframe
enc = LabelBinarizer(neg_label=0)
dataset['event_2'] = enc.fit_transform(data)
event_two = dataset['event_2']
dataset['event_1'] = (~event_two.astype(np.bool)).astype(np.int64) #this is a tip to reproduce the event_1 columns
dataset = dataset.groupby('name').sum()
dataset.reset_index(inplace=True)
and the output is :