Pandas dataframe: count number of string value is in row for specific ID

问题

I have the following use case:

I want to make a dataframe where for each row I have a column where I can see how many interactions there have been for this ID (user) in the categories. The hardest thing to me is that they can't be double counted, while a match in just one of the categories is enough to be counted as 1.

So for example I have:

   richtingen             id   
0  Marketing, Sales       1110 
1  Marketing, Sales       1110 
2  Finance                220  
3  Marketing, Engineering 1110 
4  IT                     3300

Now I want to create a third row where I can see how many times this ID has interacted with any of these categories in total. Each comma is a category on it's own so for example: "Marketing, Sales" are the two categories Marketing and Sales. To get a +1 you only need to have a match with another row where ID is the same and one of the categories matches, so for example for the index 0 it would be 3 (indexes 0, 1 and 3 match). The output data for the example should be:

   richtingen             id   freq
0  Marketing, Sales       1110 3
1  Marketing, Sales       1110 3
2  Finance                220  1
3  Marketing, Engineering 1110 3
4  IT                     3300 1

The hard part for me seems to be that I can't all categories into new rows, as then you perhaps will start counting double. For example index 0 matches both Marketing and Sales of index 1 and I want it just to add 1, not 2.

The code I have so far is:

df['freq'] = df.groupby(['id', 'richtingen'])['id'].transform('count')

this only matches identical combination of categories though.

Other things I've tried: - creating a new column with all vacancies split into an array:

df['splitted'] = df.richtingen.apply(lambda x: str(x.split(",")))

and then the plan was to use something along this code in combination with groupby on id to count number of times it is true per item:

   if any(t < 0 for t in x):
   # do something

I couldn't get this to work either.

I tried splitting categories in new rows, or columns but then got an issue of double counting.

For example using code suggested:

 df['richtingen'].str.split(', ',expand=True)

Gives me the following:

           0             1       id
    0  Marketing         Sales  1110
    1  Marketing         Sales  1110
    2        dDD          None   220
    3  Marketing   Engineering  1110
    4      ddsad          None  3300

But then I will need to create code that goes over every row, then checks the ID, lists the values in the columns and checks if they are contained in any of the other columns (where ID is the same) and if one of them matches add 1 to freq. This code I suspect might be able with groupby, but am not sure, and can't figure it out.

(Solution suggested by Jezrael below): If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame.

I think this solution perhaps is something similar to this, but at the moment it counts the total number of unique categories (not the unique number of interaction with categories). For example output at index 2 here is 2, while it should be 1 (as the user only interacted with the categories once).

    richtingen              id     freq
 0  Marketing, Sales        1110     3
 1  Marketing, Sales        1110     3
 2  Finance, Accounting     220      2
 3  Marketing, Engineering  1110     3
 4  IT                      3300     1

Hope I made myself clear and anyone knows how to fix this! In total there will be around 13 categories, always in one cell, but divided by a comma.

For msr_003:

         id          richtingen         freq_x  freq_y 
         0   220    Finance, IT           0       2
         1  1110    Finance, IT           1       2
         2  1110    Marketing, Sales      2       4
         3  1110    Marketing, Sales      3       4
         4   220    Marketing             4       1
         5   220    Finance               5       2
         6  1110    Marketing, Sales      6       4
         7  3300    IT                    7       1
         8  1110    Marketing, IT         8       4

回答1:

If need count unique catagories per id first split, create MultiIndex Series by stack and last use SeriesGroupBy.nunique with map for new column of original DataFrame:

s = (df.set_index('id')['richtingen']
       .str.split(', ',expand=True)
       .stack()
       .groupby(level=0)
        .nunique())
print (s)
id
220     1
1110    3
3300    1
dtype: int64

df['freq'] = df['id'].map(s)
print (df)
               richtingen    id  freq
0        Marketing, Sales  1110     3
1        Marketing, Sales  1110     3
2                 Finance   220     1
3  Marketing, Engineering  1110     3
4                      IT  3300     1

Detail:

print (df.set_index('id')['richtingen'].str.split(', ',expand=True).stack())
id     
1110  0      Marketing
      1          Sales
      0      Marketing
      1          Sales
220   0        Finance
1110  0      Marketing
      1    Engineering
3300  0             IT
dtype: object

回答2:

I just modified your code as below.

count_unique = pd.DataFrame({'richtingen' : ["Finance, IT","Finance, IT", "Marketing, Sales", "Marketing, Sales", "Marketing","Finance", "Marketing, Sales", "IT", "Marketing, IT"], 'id': [220,1110,1110, 1110,220, 220,1110,3300,1110]})
count_unique['freq'] = list(range(0,len(count_unique)))
grp = count_unique.groupby(['richtingen', 'id']).agg({'freq' : 'count' }).reset_index(level = [0,1])
pd.merge(count_unique,grp, on = ('richtingen','id'), how = 'left')

回答3:

I am not that into pandas. But I think you may have some luck by adding 13 new columns based on richtingen each column containing 1 or no category . You can use dataframe.apply or a similar function to compute the values when creating the columns.

Then you can take it from there by ORing stuff...

来源：https://stackoverflow.com/questions/51417159/pandas-dataframe-count-number-of-string-value-is-in-row-for-specific-id

标签

python

pandas

numpy

dataframe

data-manipulation