Create advanced frequency table with Python

匿名 (未验证) 提交于 2019-12-03 09:14:57

问题:

I am trying to make a frequency table based on a dataframe with pandas and Python. In fact it's exactly the same as a previous question of mine which used R.

Let's say that I have a dataframe in pandas that looks like this (in fact the dataframe is much larger, but for illustrative purposes I limited the rows):

node    |   precedingWord ------------------------- A-bom       de A-bom       die A-bom       de A-bom       een A-bom       n A-bom       de acroniem    het acroniem    t acroniem    het acroniem    n acroniem    een act         de act         het act         die act         dat act         t act         n 

I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter, another non-neuter and a last one rest. neuter would contain all values for which precedingWord is one of these values: t,het, dat. non-neuter would contain de and die, and rest would contain everything that doesn't belong into neuter or non-neuter. (It would be nice if this could be dynamic, in other words that rest uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)

Example output (in a new dataframe, let's say freqDf, would look like this:

node    |   neuter   | nonNeuter   | rest ----------------------------------------- A-bom       0          4             2 acroniem    3          0             2 act         3          2             1 

I found an answer to a similar question but the use case isn't exactly the same. It seems to me that in that question all variables are independent. However, in my case it is obvious that I have multiple rows with the same node, which should all be brought down to a single one frequency - as show in the expected output above.

I thought something like this (untested):

def specificFreq(d):       for uniqueWord in d['node']         return pd.Series({'node': uniqueWord ,             'neuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 't|het|dat'),             'nonNeuter':  sum(d['node' == uniqueWord] & d['precedingWord'] == 'de|die'),             'rest': len(uniqueWord) - neuter - nonNeuter}) # Length of rows with the specific word, distracted by neuter and nonneuter values above  df.groupby('node').apply(specificFreq) 

But I highly doubt this the correct way of doing something like this.

回答1:

As proposed in the R solution, you can first change the name and then perform the cross tabulation:

df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter" df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter" df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest" # neuter + non_neuter is the concatenation of both lists.  pd.crosstab(df.node, df.gender) gender    neuter  non_neuter  rest node                               A-bom          0           4     2 acroniem       3           0     2 act            3           2     1 

This one is better because if a word in neuter or non_neuter is not present in precedingword, it won't raise a KeyError like in the former solution.


Former solution, less clean.

Given your dataframe, you can make a simple cross tabulation:

ct = pd.crosstab(df.node, df.precedingWord)  

which gives:

pW        dat  de  die  een  het  n  t node                                   A-bom       0   3    1    1    0  1  0 acroniem    0   0    0    1    2  1  1 act         1   1    1    0    1  1  1 

Then, you just want to sum certain columns together:

neuter = ["t", "het", "dat"] non_neuter = ["de","die"] freqDf = pd.DataFrame()  freqDf["neuter"] = ct[neuter].sum(axis=1) ct.drop(neuter, axis=1, inplace=1)  freqDf["non_neuter"] = ct[non_neuter].sum(axis=1) ct.drop(non_neuter, axis=1, inplace=1)  freqDf["rest"] = ct.sum(axis=1) 

Which gives you freqDf:

          neuter  non_neuter  rest node                               A-bom          0           4     2 acroniem       3           0     2 act            3           2     1 

HTH



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!