How to perform a cumulative sum of distinct values in pandas dataframe

China☆狼群 提交于 2021-02-07 06:01:50

问题


I have a dataframe like this:

id    date         company    ......
123   2019-01-01        A
224   2019-01-01        B
345   2019-01-01        B
987   2019-01-03        C
334   2019-01-03        C
908   2019-01-04        C
765   2019-01-04        A
554   2019-01-05        A
482   2019-01-05        D

and I want to get the cumulative number of unique values over time for the 'company' column. So if a company appears at a later date they are not counted again.

My expected output is:

date            cumulative_count
2019-01-01      2
2019-01-03      3
2019-01-04      3
2019-01-05      4

I've tried:

df.groupby(['date']).company.nunique().cumsum()

but this double counts if the same company appears on a different date.


回答1:


Using duplicated + cumsum + last

m = df.duplicated('company')
d = df['date']

(~m).cumsum().groupby(d).last()

date
2019-01-01    2
2019-01-03    3
2019-01-04    3
2019-01-05    4
dtype: int32



回答2:


Another way try to fix anky_91

(df.company.map(hash)).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()
Out[196]: 
date
2019-01-01    2.0
2019-01-03    3.0
2019-01-04    3.0
2019-01-05    4.0
Name: company, dtype: float64

From anky_91

(df.company.astype('category').cat.codes).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()



回答3:


This takes more code than anky's answer, but still works for the sample data:

df = df.sort_values('date')
(df.drop_duplicates(['company'])
   .groupby('date')
   .size().cumsum()
   .reindex(df['date'].unique())
   .ffill()
)

Output:

date
2019-01-01    2.0
2019-01-03    3.0
2019-01-04    3.0
2019-01-05    4.0
dtype: float64


来源:https://stackoverflow.com/questions/57807505/how-to-perform-a-cumulative-sum-of-distinct-values-in-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!