I was looking into pandas to do some simple calculations on NLP and text mining but I couldn\'t quite grasp how to do them.
Suppose I have the following data frame,
This method should generalize fairly well:
In [100]: df
Out[100]:
gender name firstname shingles
0 M John Doe John [Joh, ohn]
1 F Mary Poppins Mary [Mar, ary]
2 F Jane Doe Jane [Jan, ane]
3 M John Cusack John [Joh, ohn]
First create an "expanded" series where every entry is a shingle. Here, the index of the series is a multindex where the first level represents the shingle position and the second level represents the index of the original DF:
In [103]: s = df.shingles.apply(lambda x: pandas.Series(x)).unstack();
Out[103]:
0 0 Joh
1 Mar
2 Jan
3 Joh
1 0 ohn
1 ary
2 ane
3 ohn
Next, we can join the created series into the original dataframe. You have to reset the index, dropping the shingle position level. The resulting series has the original index and an entry for each shingle. Merging this into the original dataframe produces:
In [106]: df2 = df.join(pandas.DataFrame(s.reset_index(level=0, drop=True))); df2
Out[106]:
gender name firstname shingles 0
0 M John Doe John [Joh, ohn] Joh
0 M John Doe John [Joh, ohn] ohn
1 F Mary Poppins Mary [Mar, ary] Mar
1 F Mary Poppins Mary [Mar, ary] ary
2 F Jane Doe Jane [Jan, ane] Jan
2 F Jane Doe Jane [Jan, ane] ane
3 M John Cusack John [Joh, ohn] Joh
3 M John Cusack John [Joh, ohn] ohn
Finally, you can do your groupby operation on Gender, unstack the returned series and fill the NaN's with zeroes:
In [124]: df2.groupby(0, sort=False)['gender'].value_counts().unstack().fillna(0)
Out[124]:
F M
0
Joh 0 2
ohn 0 2
Mar 1 0
ary 1 0
Jan 1 0
ane 1 0