问题
I'm analysing a Apache log file and I have imported it in to a pandas dataframe.
'65.55.52.118 - - [30/May/2013:06:58:52 -0600] "GET /detailedAddVen.php?refId=7954&uId=2802 HTTP/1.1" 200 4514 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"'
My dataframe:
I want to group this in to sessions based on IP, Agent and Time difference (If the duration of time is greater than 30 mins it should be a new session).
It is easy to group the dataframe by IP and Agent but how to check this time difference?Hope the problem is clear.
sessions = df.groupby(['IP', 'Agent']).size()
UPDATE : df.index is like follows:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-30 06:00:41, ..., 2013-05-30 22:29:14]
Length: 31975, Freq: None, Timezone: None
回答1:
I would do this using a shift and a cumsum (here's a simple example, with numbers instead of times - but they would work exactly the same):
In [11]: s = pd.Series([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])
In [12]: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False) # *
Out[12]:
0 0
1 0
2 0
3 1
4 1
5 2
6 2
dtype: int64
* the need for skipna=False appears to be a bug.
Then you can use this in a groupby apply:
In [21]: df = pd.DataFrame([[1.1, 1.7, 2.5, 2.6, 2.7, 3.4], list('AAABBB')]).T
In [22]: df.columns = ['time', 'ip']
In [23]: df
Out[23]:
time ip
0 1.1 A
1 1.7 A
2 2.5 A
3 2.6 B
4 2.7 B
5 3.4 B
In [24]: g = df.groupby('ip')
In [25]: df['session_number'] = g['time'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))
In [26]: df
Out[26]:
time ip session_number
0 1.1 A 0
1 1.7 A 1
2 2.5 A 2
3 2.6 B 0
4 2.7 B 0
5 3.4 B 1
Now you can groupby 'ip'
and 'session_number'
(and analyse each session).
回答2:
Andy Hayden's answer is lovely and concise, but it gets very slow if you have a large number of users/IP addresses to group over. Here's another method that's much uglier but also much faster.
import pandas as pd
import numpy as np
sample = lambda x: np.random.choice(x, size=10000)
df = pd.DataFrame({'ip': sample(range(500)),
'time': sample([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])})
max_diff = 0.5 # Max time difference
def method_1(df):
df = df.sort_values('time')
g = df.groupby('ip')
df['session'] = g['time'].apply(
lambda s: (s - s.shift(1) > max_diff).fillna(0).cumsum(skipna=False)
)
return df['session']
def method_2(df):
# Sort by ip then time
df = df.sort_values(['ip', 'time'])
# Get locations where the ip changes
ip_change = df.ip != df.ip.shift()
time_or_ip_change = (df.time - df.time.shift() > max_diff) | ip_change
df['session'] = time_or_ip_change.cumsum()
# The cumsum operated over the whole series, so subtract out the first
# value for each IP
df['tmp'] = 0
df.loc[ip_change, 'tmp'] = df.loc[ip_change, 'session']
df['tmp'] = np.maximum.accumulate(df.tmp)
df['session'] = df.session - df.tmp
# Delete the temporary column
del df['tmp']
return df['session']
r1 = method_1(df)
r2 = method_2(df)
assert (r1.sort_index() == r2.sort_index()).all()
%timeit method_1(df)
%timeit method_2(df)
400 ms ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.6 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
来源:https://stackoverflow.com/questions/17547391/session-generation-from-log-file-analysis-with-pandas