问题
The data looks like the following:
id timestamp date value
1 2001-01-01 2001-05-01 0
1 2001-10-01 2001-05-01 1
2 2001-01-01 2001-05-01 0
2 2001-10-01 2001-05-01 0
as you see the table contains the columns id
, timestamp
, date
and value
.
Every row with the same id
also has the same date
.
Furthermore date
is timewise always somewhere in between the first and the last timestamp
of each id
.
The task is to filter the table in the way to remove every id
which does not contain at least one entry with value > 0
at a timestamp after their individual date
.
I implemented it in the way that I multi-index the table with level 0 = id
and level 1 = date
and sort it. Then I group it by level 0
. Next I loop through every group (id
) and assign a new value telling me if the id
is "good" (boolean). Finally I filter the table where good is True
.
Unfortunately this implementation is slow like hell for a big (>10M rows) dataset.
I am looking for a way to speed this up. My idea was using groupby.apply(lambda g: something)
but I did not get it to work and I do not know if this is the fastest option possible.
Working Code Example:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 2, 2],
'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'],
'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
'value': [0, 1, 0, 0]})
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
index = df_id.index
df_id = df_id.droplevel(0)
df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]
回答1:
For get all id
s by 1
in value
column and also timestamp
are higher like date
create 2 masks by Series.gt, chain by &
for bitwise AND
and then test if at least one True
per group by GroupBy.any and GroupBy.transform:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])
m = df['value'].gt(0) & df['timestamp'].gt(df['date'])
df = df[m.groupby(df['id']).transform('any')]
print (df)
id timestamp date value
0 1 2001-01-01 2001-01-05 0
1 1 2001-01-10 2001-01-05 1
来源:https://stackoverflow.com/questions/61074421/filter-multi-indexed-grouped-pandas-dataframe