python split a pandas data frame by week or month and group the data based on these sp

后端 未结 2 1623
南笙
南笙 2021-01-03 12:57
DateOccurred    CostCentre  TimeDifference
03/09/2012  2073    28138
03/09/2012  6078    34844
03/09/2012  8273    31215
03/09/2012  8367    28160
03/09/2012  8959           


        
相关标签:
2条回答
  • 2021-01-03 13:19

    Perhaps group by CostCentre first, then use Series/DataFrame resample()?

    In [72]: centers = {}
    
    In [73]: for center, idx in df.groupby("CostCentre").groups.iteritems():
       ....:     timediff = df.ix[idx].set_index("Date")['TimeDifference']
       ....:     centers[center] = timediff.resample("W", how=sum)
    
    In [77]: pd.concat(centers, names=['CostCentre'])
    Out[77]: 
    CostCentre  Date      
    0           2012-09-09         0
                2012-09-16     89522
                2012-09-23         6
                2012-09-30       161
    2073        2012-09-09    141208
                2012-09-16    113024
                2012-09-23    169599
                2012-09-30    170780
    6078        2012-09-09    171481
                2012-09-16    160871
                2012-09-23    153976
                2012-09-30    122972
    

    Additional details:

    When parse_dates is True for the pd.read_* functions, index_col must also be set.

    In [28]: df = pd.read_clipboard(sep=' +', parse_dates=True, index_col=0,
       ....:                        dayfirst=True)
    
    In [30]: df.head()
    Out[30]: 
                  CostCentre  TimeDifference
    DateOccurred                            
    2012-09-03          2073           28138
    2012-09-03          6078           34844
    2012-09-03          8273           31215
    2012-09-03          8367           28160
    2012-09-03          8959           32037
    

    Since resample() requires a TimeSeries-indexed frame/series, setting the index during creation eliminates the need to set the index for each group individually. GroupBy objects also have an apply method, which is basically syntactic sugar around the "combine" step done with pd.concat() above.

    In [37]: x = df.groupby("CostCentre").apply(lambda df: 
       ....:         df['TimeDifference'].resample("W", how=sum))
    
    In [38]: x.head(12)
    Out[38]: 
    CostCentre  DateOccurred
    0           2012-09-09           0
                2012-09-16       89522
                2012-09-23           6
                2012-09-30         161
    2073        2012-09-09      141208
                2012-09-16      113024
                2012-09-23      169599
                2012-09-30      170780
    6078        2012-09-09      171481
                2012-09-16      160871
                2012-09-23      153976
                2012-09-30      122972
    
    0 讨论(0)
  • 2021-01-03 13:34

    Here's a way to take your input (as text) and group it the way you want. The key is to use a dictionary for each grouping (date, then centre).

    import collections
    import datetime
    import functools
    
    def delta_totals_by_date_and_centre(in_file):
        # Use a defaultdict instead of a normal dict so that missing values are
        # automatically created. by_date is a mapping (dict) from a tuple of (year, week)
        # to another mapping (dict) from centre to total delta time.
        by_date = collections.defaultdict(functools.partial(collections.defaultdict, int))
    
        # For each line in the input...
        for line in in_file:
            # Parse the three fields of each line into date, int ,int.
            date, centre, delta = line.split()
            date = datetime.datetime.strptime(date, "%d/%m/%Y").date()
            centre = int(centre)
            delta = int(delta)
    
            # Determine the year and week of the year.
            year, week, weekday = date.isocalendar()
            year_and_week = year, week
    
            # Add the time delta.
            by_date[year_and_week][centre] += delta
    
        # Yield each result, in order.
        for year_and_week, by_centre in sorted(by_date.items()):
            for centre, delta in sorted(by_centre.items()):
                yield year_and_week, centre, delta
    

    For your sample input, it produces this output (where the first column is year-week_of_the_year).

    2012-36     0      0
    2012-36  2073 141208
    2012-36  6078 171481
    2012-36  7042  27129
    2012-36  7569 124600
    2012-36  8239  82153
    2012-36  8273 154517
    2012-36  8367 113339
    2012-36  8959  82770
    2012-36  9292 128089
    2012-36  9532 137491
    2012-36  9705 146321
    2012-36 10085 151483
    2012-36 10220  87496
    2012-36 14573    186
    2012-37     0  89522
    2012-37  2073 113024
    2012-37  6078 160871
    2012-37  7042  35063
    2012-37  7097  30866
    2012-37  8239  61744
    2012-37  8273 153898
    2012-37  8367  93564
    2012-37  8959 116727
    2012-37  9292 132628
    2012-37  9532 121462
    2012-37  9705 139992
    2012-37 10085 111229
    2012-37 10220  91245
    2012-38     0      6
    2012-38  2073 169599
    2012-38  6078 153976
    2012-38  7097  34909
    2012-38  7569 152958
    2012-38  8239 122693
    2012-38  8273 119536
    2012-38  8367 116157
    2012-38  8959  75579
    2012-38  9292 128340
    2012-38  9532 163278
    2012-38  9705  95205
    2012-38 10085  94284
    2012-38 10220  92318
    2012-38 14573    468
    2012-39     0    161
    2012-39  2073 170780
    2012-39  6078 122972
    2012-39  7042  34953
    2012-39  7097  63475
    2012-39  7569  92371
    2012-39  8239 194048
    2012-39  8273 123332
    2012-39  8367 115365
    2012-39  8959 104609
    2012-39  9292 131369
    2012-39  9532 143933
    2012-39  9705 123107
    2012-39 10085 129276
    2012-39 10220 124681
    
    0 讨论(0)
提交回复
热议问题