filter multi-indexed grouped pandas dataframe

最后都变了- 提交于 2020-05-24 03:31:03

问题


The data looks like the following:

id  timestamp   date        value
1   2001-01-01  2001-05-01  0
1   2001-10-01  2001-05-01  1
2   2001-01-01  2001-05-01  0
2   2001-10-01  2001-05-01  0

as you see the table contains the columns id, timestamp, date and value. Every row with the same id also has the same date. Furthermore date is timewise always somewhere in between the first and the last timestamp of each id.

The task is to filter the table in the way to remove every id which does not contain at least one entry with value > 0 at a timestamp after their individual date.

I implemented it in the way that I multi-index the table with level 0 = id and level 1 = date and sort it. Then I group it by level 0. Next I loop through every group (id) and assign a new value telling me if the id is "good" (boolean). Finally I filter the table where good is True.

Unfortunately this implementation is slow like hell for a big (>10M rows) dataset. I am looking for a way to speed this up. My idea was using groupby.apply(lambda g: something) but I did not get it to work and I do not know if this is the fastest option possible.

Working Code Example:

import pandas as pd

df = pd.DataFrame({'id': [1, 1, 2, 2],
                   'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'], 
                   'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
                   'value': [0, 1, 0, 0]})                               

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
    index = df_id.index
    df_id = df_id.droplevel(0)
    df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]

回答1:


For get all ids by 1 in value column and also timestamp are higher like date create 2 masks by Series.gt, chain by & for bitwise AND and then test if at least one True per group by GroupBy.any and GroupBy.transform:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])

m = df['value'].gt(0) & df['timestamp'].gt(df['date'])
df = df[m.groupby(df['id']).transform('any')]
print (df)
   id  timestamp       date  value
0   1 2001-01-01 2001-01-05      0
1   1 2001-01-10 2001-01-05      1


来源:https://stackoverflow.com/questions/61074421/filter-multi-indexed-grouped-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!