Grouping daily data by month in python/pandas and then normalizing

后端 未结 1 542
予麋鹿
予麋鹿 2021-02-19 03:37

I have the table below in a Pandas DataFrame:

    q_string    q_visits    q_date
0   nucleus         1790        2012-10-02 00:00:00
1   neuron              


        
相关标签:
1条回答
  • 2021-02-19 03:52

    If I understand you correctly:

    For (1) do this:

    Make some fake data by sampling from the values you gave and some random dates and # of visits:

    In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')
    
    In [180]: visits = Series(poisson(1000, size=100), name='date')
    
    In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')
    
    In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})
    
    In [183]: df.head()
    Out[183]:
                     date   string  visits
    0 2001-11-15 00:00:00  current     997
    1 2001-11-15 00:00:00  current     974
    2 2012-10-02 00:00:00     stem     982
    3 2001-12-01 00:00:00     stem     984
    4 2001-01-01 00:00:00  current     989
    
    In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')
    
    In [187]: resamp.head()
    Out[187]:
                        visits
    string  date
    current 2001-01-31    2996
            2001-02-28     NaN
            2001-03-31     NaN
            2001-04-30     NaN
            2001-05-31    3016
    

    NaN is there because there were no visits with that query string in those months.

    For (2), group by the dates and then divide by the sum:

    In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())
    
    In [189]: g.head()
    Out[189]:
                        visits
    string  date
    current 2001-01-31   0.177
            2001-02-28     NaN
            2001-03-31     NaN
            2001-04-30     NaN
            2001-05-31   0.188
    

    Just to convince you that (2) is doing what you want:

    In [176]: h = g.sortlevel('date').head()
    
    In [177]: h
    Out[177]:
                          visits
    string    date
    current   2001-01-31   0.077
    molecular 2001-01-31   0.228
    neuron    2001-01-31   0.073
    nucleus   2001-01-31   0.234
    stem      2001-01-31   0.388
    
    In [178]: h.sum()
    Out[178]:
    visits    1
    dtype: float64
    

    If you want to convert resamp into a DataFrame and remove the NaNs do:

    In [196]: resamp.dropna()
    Out[196]:
                          visits
    string    date
    current   2001-01-31    2996
              2001-05-31    3016
              2001-11-30    5959
              2001-12-31    3998
              2013-09-30    1077
    molecular 2001-01-31    3984
              2001-05-31    1911
              2001-11-30    3054
              2001-12-31    1020
              2012-10-31     977
              2013-09-30    1947
    neuron    2001-01-31    3961
              2001-05-31    2069
              2001-11-30    5010
              2001-12-31    2065
              2012-10-31    6973
              2013-09-30     994
    nucleus   2001-01-31    3060
              2001-05-31    3035
              2001-11-30    2924
              2001-12-31    4144
              2012-10-31    2004
              2013-09-30    7881
    stem      2001-01-31    2911
              2001-05-31    5994
              2001-11-30    6072
              2001-12-31    4916
              2012-10-31    1991
              2013-09-30    3977
    
    In [197]: resamp.dropna().reset_index()
    Out[197]:
           string                date  visits
    0     current 2001-01-31 00:00:00    2996
    1     current 2001-05-31 00:00:00    3016
    2     current 2001-11-30 00:00:00    5959
    3     current 2001-12-31 00:00:00    3998
    4     current 2013-09-30 00:00:00    1077
    5   molecular 2001-01-31 00:00:00    3984
    6   molecular 2001-05-31 00:00:00    1911
    7   molecular 2001-11-30 00:00:00    3054
    8   molecular 2001-12-31 00:00:00    1020
    9   molecular 2012-10-31 00:00:00     977
    10  molecular 2013-09-30 00:00:00    1947
    11     neuron 2001-01-31 00:00:00    3961
    12     neuron 2001-05-31 00:00:00    2069
    13     neuron 2001-11-30 00:00:00    5010
    14     neuron 2001-12-31 00:00:00    2065
    15     neuron 2012-10-31 00:00:00    6973
    16     neuron 2013-09-30 00:00:00     994
    17    nucleus 2001-01-31 00:00:00    3060
    18    nucleus 2001-05-31 00:00:00    3035
    19    nucleus 2001-11-30 00:00:00    2924
    20    nucleus 2001-12-31 00:00:00    4144
    21    nucleus 2012-10-31 00:00:00    2004
    22    nucleus 2013-09-30 00:00:00    7881
    23       stem 2001-01-31 00:00:00    2911
    24       stem 2001-05-31 00:00:00    5994
    25       stem 2001-11-30 00:00:00    6072
    26       stem 2001-12-31 00:00:00    4916
    27       stem 2012-10-31 00:00:00    1991
    28       stem 2013-09-30 00:00:00    3977
    

    You can of course do this for g as well:

    In [198]: g.dropna()
    Out[198]:
                          visits
    string    date
    current   2001-01-31   0.177
              2001-05-31   0.188
              2001-11-30   0.259
              2001-12-31   0.248
              2013-09-30   0.068
    molecular 2001-01-31   0.236
              2001-05-31   0.119
              2001-11-30   0.133
              2001-12-31   0.063
              2012-10-31   0.082
              2013-09-30   0.123
    neuron    2001-01-31   0.234
              2001-05-31   0.129
              2001-11-30   0.218
              2001-12-31   0.128
              2012-10-31   0.584
              2013-09-30   0.063
    nucleus   2001-01-31   0.181
              2001-05-31   0.189
              2001-11-30   0.127
              2001-12-31   0.257
              2012-10-31   0.168
              2013-09-30   0.496
    stem      2001-01-31   0.172
              2001-05-31   0.374
              2001-11-30   0.264
              2001-12-31   0.305
              2012-10-31   0.167
              2013-09-30   0.251
    
    In [199]: g.dropna().reset_index()
    Out[199]:
           string                date  visits
    0     current 2001-01-31 00:00:00   0.177
    1     current 2001-05-31 00:00:00   0.188
    2     current 2001-11-30 00:00:00   0.259
    3     current 2001-12-31 00:00:00   0.248
    4     current 2013-09-30 00:00:00   0.068
    5   molecular 2001-01-31 00:00:00   0.236
    6   molecular 2001-05-31 00:00:00   0.119
    7   molecular 2001-11-30 00:00:00   0.133
    8   molecular 2001-12-31 00:00:00   0.063
    9   molecular 2012-10-31 00:00:00   0.082
    10  molecular 2013-09-30 00:00:00   0.123
    11     neuron 2001-01-31 00:00:00   0.234
    12     neuron 2001-05-31 00:00:00   0.129
    13     neuron 2001-11-30 00:00:00   0.218
    14     neuron 2001-12-31 00:00:00   0.128
    15     neuron 2012-10-31 00:00:00   0.584
    16     neuron 2013-09-30 00:00:00   0.063
    17    nucleus 2001-01-31 00:00:00   0.181
    18    nucleus 2001-05-31 00:00:00   0.189
    19    nucleus 2001-11-30 00:00:00   0.127
    20    nucleus 2001-12-31 00:00:00   0.257
    21    nucleus 2012-10-31 00:00:00   0.168
    22    nucleus 2013-09-30 00:00:00   0.496
    23       stem 2001-01-31 00:00:00   0.172
    24       stem 2001-05-31 00:00:00   0.374
    25       stem 2001-11-30 00:00:00   0.264
    26       stem 2001-12-31 00:00:00   0.305
    27       stem 2012-10-31 00:00:00   0.167
    28       stem 2013-09-30 00:00:00   0.251
    

    Lastly, if you want to put your columns in a different order, use reindex:

    In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
    Out[210]:
        visits     string                date
    0    0.177    current 2001-01-31 00:00:00
    1    0.188    current 2001-05-31 00:00:00
    2    0.259    current 2001-11-30 00:00:00
    3    0.248    current 2001-12-31 00:00:00
    4    0.068    current 2013-09-30 00:00:00
    5    0.236  molecular 2001-01-31 00:00:00
    6    0.119  molecular 2001-05-31 00:00:00
    7    0.133  molecular 2001-11-30 00:00:00
    8    0.063  molecular 2001-12-31 00:00:00
    9    0.082  molecular 2012-10-31 00:00:00
    10   0.123  molecular 2013-09-30 00:00:00
    11   0.234     neuron 2001-01-31 00:00:00
    12   0.129     neuron 2001-05-31 00:00:00
    13   0.218     neuron 2001-11-30 00:00:00
    14   0.128     neuron 2001-12-31 00:00:00
    15   0.584     neuron 2012-10-31 00:00:00
    16   0.063     neuron 2013-09-30 00:00:00
    17   0.181    nucleus 2001-01-31 00:00:00
    18   0.189    nucleus 2001-05-31 00:00:00
    19   0.127    nucleus 2001-11-30 00:00:00
    20   0.257    nucleus 2001-12-31 00:00:00
    21   0.168    nucleus 2012-10-31 00:00:00
    22   0.496    nucleus 2013-09-30 00:00:00
    23   0.172       stem 2001-01-31 00:00:00
    24   0.374       stem 2001-05-31 00:00:00
    25   0.264       stem 2001-11-30 00:00:00
    26   0.305       stem 2001-12-31 00:00:00
    27   0.167       stem 2012-10-31 00:00:00
    28   0.251       stem 2013-09-30 00:00:00
    
    0 讨论(0)
提交回复
热议问题