问题
I have a dataframe like this:
id date company ......
123 2019-01-01 A
224 2019-01-01 B
345 2019-01-01 B
987 2019-01-03 C
334 2019-01-03 C
908 2019-01-04 C
765 2019-01-04 A
554 2019-01-05 A
482 2019-01-05 D
and I want to get the cumulative number of unique values over time for the 'company' column. So if a company appears at a later date they are not counted again.
My expected output is:
date cumulative_count
2019-01-01 2
2019-01-03 3
2019-01-04 3
2019-01-05 4
I've tried:
df.groupby(['date']).company.nunique().cumsum()
but this double counts if the same company appears on a different date.
回答1:
Using duplicated
+ cumsum
+ last
m = df.duplicated('company')
d = df['date']
(~m).cumsum().groupby(d).last()
date
2019-01-01 2
2019-01-03 3
2019-01-04 3
2019-01-05 4
dtype: int32
回答2:
Another way try to fix anky_91
(df.company.map(hash)).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()
Out[196]:
date
2019-01-01 2.0
2019-01-03 3.0
2019-01-04 3.0
2019-01-05 4.0
Name: company, dtype: float64
From anky_91
(df.company.astype('category').cat.codes).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()
回答3:
This takes more code than anky's answer, but still works for the sample data:
df = df.sort_values('date')
(df.drop_duplicates(['company'])
.groupby('date')
.size().cumsum()
.reindex(df['date'].unique())
.ffill()
)
Output:
date
2019-01-01 2.0
2019-01-03 3.0
2019-01-04 3.0
2019-01-05 4.0
dtype: float64
来源:https://stackoverflow.com/questions/57807505/how-to-perform-a-cumulative-sum-of-distinct-values-in-pandas-dataframe