I am trying to make a frequency table based on a dataframe with pandas
and Python. In fact it's exactly the same as a previous question of mine which used R.
Let's say that I have a dataframe in pandas that looks like this (in fact the dataframe is much larger, but for illustrative purposes I limited the rows):
node | precedingWord ------------------------- A-bom de A-bom die A-bom de A-bom een A-bom n A-bom de acroniem het acroniem t acroniem het acroniem n acroniem een act de act het act die act dat act t act n
I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter
, another non-neuter
and a last one rest
. neuter
would contain all values for which precedingWord is one of these values: t
,het
, dat
. non-neuter
would contain de
and die,
and rest
would contain everything that doesn't belong into neuter
or non-neuter
. (It would be nice if this could be dynamic, in other words that rest
uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)
Example output (in a new dataframe, let's say freqDf
, would look like this:
node | neuter | nonNeuter | rest ----------------------------------------- A-bom 0 4 2 acroniem 3 0 2 act 3 2 1
I found an answer to a similar question but the use case isn't exactly the same. It seems to me that in that question all variables are independent. However, in my case it is obvious that I have multiple rows with the same node, which should all be brought down to a single one frequency - as show in the expected output above.
I thought something like this (untested):
def specificFreq(d): for uniqueWord in d['node'] return pd.Series({'node': uniqueWord , 'neuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 't|het|dat'), 'nonNeuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 'de|die'), 'rest': len(uniqueWord) - neuter - nonNeuter}) # Length of rows with the specific word, distracted by neuter and nonneuter values above df.groupby('node').apply(specificFreq)
But I highly doubt this the correct way of doing something like this.