I have the table below in a Pandas DataFrame
:
q_string q_visits q_date
0 nucleus 1790 2012-10-02 00:00:00
1 neuron
If I understand you correctly:
For (1) do this:
Make some fake data by sampling from the values you gave and some random dates and # of visits:
In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')
In [180]: visits = Series(poisson(1000, size=100), name='date')
In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')
In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})
In [183]: df.head()
Out[183]:
date string visits
0 2001-11-15 00:00:00 current 997
1 2001-11-15 00:00:00 current 974
2 2012-10-02 00:00:00 stem 982
3 2001-12-01 00:00:00 stem 984
4 2001-01-01 00:00:00 current 989
In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')
In [187]: resamp.head()
Out[187]:
visits
string date
current 2001-01-31 2996
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 3016
NaN
is there because there were no visits with that query string in those months.
For (2), group by the dates and then divide by the sum:
In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())
In [189]: g.head()
Out[189]:
visits
string date
current 2001-01-31 0.177
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 0.188
Just to convince you that (2) is doing what you want:
In [176]: h = g.sortlevel('date').head()
In [177]: h
Out[177]:
visits
string date
current 2001-01-31 0.077
molecular 2001-01-31 0.228
neuron 2001-01-31 0.073
nucleus 2001-01-31 0.234
stem 2001-01-31 0.388
In [178]: h.sum()
Out[178]:
visits 1
dtype: float64
If you want to convert resamp
into a DataFrame
and remove the NaN
s do:
In [196]: resamp.dropna()
Out[196]:
visits
string date
current 2001-01-31 2996
2001-05-31 3016
2001-11-30 5959
2001-12-31 3998
2013-09-30 1077
molecular 2001-01-31 3984
2001-05-31 1911
2001-11-30 3054
2001-12-31 1020
2012-10-31 977
2013-09-30 1947
neuron 2001-01-31 3961
2001-05-31 2069
2001-11-30 5010
2001-12-31 2065
2012-10-31 6973
2013-09-30 994
nucleus 2001-01-31 3060
2001-05-31 3035
2001-11-30 2924
2001-12-31 4144
2012-10-31 2004
2013-09-30 7881
stem 2001-01-31 2911
2001-05-31 5994
2001-11-30 6072
2001-12-31 4916
2012-10-31 1991
2013-09-30 3977
In [197]: resamp.dropna().reset_index()
Out[197]:
string date visits
0 current 2001-01-31 00:00:00 2996
1 current 2001-05-31 00:00:00 3016
2 current 2001-11-30 00:00:00 5959
3 current 2001-12-31 00:00:00 3998
4 current 2013-09-30 00:00:00 1077
5 molecular 2001-01-31 00:00:00 3984
6 molecular 2001-05-31 00:00:00 1911
7 molecular 2001-11-30 00:00:00 3054
8 molecular 2001-12-31 00:00:00 1020
9 molecular 2012-10-31 00:00:00 977
10 molecular 2013-09-30 00:00:00 1947
11 neuron 2001-01-31 00:00:00 3961
12 neuron 2001-05-31 00:00:00 2069
13 neuron 2001-11-30 00:00:00 5010
14 neuron 2001-12-31 00:00:00 2065
15 neuron 2012-10-31 00:00:00 6973
16 neuron 2013-09-30 00:00:00 994
17 nucleus 2001-01-31 00:00:00 3060
18 nucleus 2001-05-31 00:00:00 3035
19 nucleus 2001-11-30 00:00:00 2924
20 nucleus 2001-12-31 00:00:00 4144
21 nucleus 2012-10-31 00:00:00 2004
22 nucleus 2013-09-30 00:00:00 7881
23 stem 2001-01-31 00:00:00 2911
24 stem 2001-05-31 00:00:00 5994
25 stem 2001-11-30 00:00:00 6072
26 stem 2001-12-31 00:00:00 4916
27 stem 2012-10-31 00:00:00 1991
28 stem 2013-09-30 00:00:00 3977
You can of course do this for g
as well:
In [198]: g.dropna()
Out[198]:
visits
string date
current 2001-01-31 0.177
2001-05-31 0.188
2001-11-30 0.259
2001-12-31 0.248
2013-09-30 0.068
molecular 2001-01-31 0.236
2001-05-31 0.119
2001-11-30 0.133
2001-12-31 0.063
2012-10-31 0.082
2013-09-30 0.123
neuron 2001-01-31 0.234
2001-05-31 0.129
2001-11-30 0.218
2001-12-31 0.128
2012-10-31 0.584
2013-09-30 0.063
nucleus 2001-01-31 0.181
2001-05-31 0.189
2001-11-30 0.127
2001-12-31 0.257
2012-10-31 0.168
2013-09-30 0.496
stem 2001-01-31 0.172
2001-05-31 0.374
2001-11-30 0.264
2001-12-31 0.305
2012-10-31 0.167
2013-09-30 0.251
In [199]: g.dropna().reset_index()
Out[199]:
string date visits
0 current 2001-01-31 00:00:00 0.177
1 current 2001-05-31 00:00:00 0.188
2 current 2001-11-30 00:00:00 0.259
3 current 2001-12-31 00:00:00 0.248
4 current 2013-09-30 00:00:00 0.068
5 molecular 2001-01-31 00:00:00 0.236
6 molecular 2001-05-31 00:00:00 0.119
7 molecular 2001-11-30 00:00:00 0.133
8 molecular 2001-12-31 00:00:00 0.063
9 molecular 2012-10-31 00:00:00 0.082
10 molecular 2013-09-30 00:00:00 0.123
11 neuron 2001-01-31 00:00:00 0.234
12 neuron 2001-05-31 00:00:00 0.129
13 neuron 2001-11-30 00:00:00 0.218
14 neuron 2001-12-31 00:00:00 0.128
15 neuron 2012-10-31 00:00:00 0.584
16 neuron 2013-09-30 00:00:00 0.063
17 nucleus 2001-01-31 00:00:00 0.181
18 nucleus 2001-05-31 00:00:00 0.189
19 nucleus 2001-11-30 00:00:00 0.127
20 nucleus 2001-12-31 00:00:00 0.257
21 nucleus 2012-10-31 00:00:00 0.168
22 nucleus 2013-09-30 00:00:00 0.496
23 stem 2001-01-31 00:00:00 0.172
24 stem 2001-05-31 00:00:00 0.374
25 stem 2001-11-30 00:00:00 0.264
26 stem 2001-12-31 00:00:00 0.305
27 stem 2012-10-31 00:00:00 0.167
28 stem 2013-09-30 00:00:00 0.251
Lastly, if you want to put your columns in a different order, use reindex
:
In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
visits string date
0 0.177 current 2001-01-31 00:00:00
1 0.188 current 2001-05-31 00:00:00
2 0.259 current 2001-11-30 00:00:00
3 0.248 current 2001-12-31 00:00:00
4 0.068 current 2013-09-30 00:00:00
5 0.236 molecular 2001-01-31 00:00:00
6 0.119 molecular 2001-05-31 00:00:00
7 0.133 molecular 2001-11-30 00:00:00
8 0.063 molecular 2001-12-31 00:00:00
9 0.082 molecular 2012-10-31 00:00:00
10 0.123 molecular 2013-09-30 00:00:00
11 0.234 neuron 2001-01-31 00:00:00
12 0.129 neuron 2001-05-31 00:00:00
13 0.218 neuron 2001-11-30 00:00:00
14 0.128 neuron 2001-12-31 00:00:00
15 0.584 neuron 2012-10-31 00:00:00
16 0.063 neuron 2013-09-30 00:00:00
17 0.181 nucleus 2001-01-31 00:00:00
18 0.189 nucleus 2001-05-31 00:00:00
19 0.127 nucleus 2001-11-30 00:00:00
20 0.257 nucleus 2001-12-31 00:00:00
21 0.168 nucleus 2012-10-31 00:00:00
22 0.496 nucleus 2013-09-30 00:00:00
23 0.172 stem 2001-01-31 00:00:00
24 0.374 stem 2001-05-31 00:00:00
25 0.264 stem 2001-11-30 00:00:00
26 0.305 stem 2001-12-31 00:00:00
27 0.167 stem 2012-10-31 00:00:00
28 0.251 stem 2013-09-30 00:00:00