pandas mean calculation over a column in a csv

问题

I have some data in a csv file as show below(only partial data is shown here).

SourceID   local_date   local_time  Vge    BSs              PC         hour Type
7208       8/01/2015    11:00:19    15.4    87             +BC_MSG      11  MAIN
11060      8/01/2015    11:01:56    14.9    67             +AB_MSG      11  MAIN
3737       8/01/2015    11:02:09    15.4    88             +AB_MSG      11  MAIN
9683       8/01/2015    11:07:19    14.9    69             +AB_MSG      11  MAIN
9276       8/01/2015    11:07:52    15.4    88             +AB_MSG      11  MAIN
7754       8/01/2015    11:09:26    14.7    62             +AF_MSG      11  MAIN
11111      8/01/2015    11:10:06    15.2    80             +AF_MSG      11  MAIN
9276       8/01/2015    11:10:52    15.4    88             +AB_MSG      11  MAIN
11111      8/01/2015    11:12:56    15.2    80             +AB_MSG      11  MAIN
6148       8/01/2015    11:15:29    15      70             +AB_MSG      11  MAIN
11111      8/01/2015    11:15:56    15.2    80             +AB_MSG      11  MAIN
9866       8/01/2015    11:16:28    4.102   80             +SUB_MSG     11  SUB
9866       8/01/2015    11:16:38    15.1    78             +AH_MSG      11  MAIN
9866       8/01/2015    11:16:38    4.086   78             +SUB_MSG     11  SUB
20729      8/01/2015    11:23:21    11.6    82             +AB_MSG      11  MAIN
9276       8/01/2015    11:25:52    15.4    88             +AB_MSG      11  MAIN
11111      8/01/2015    11:34:16    15.2    80             +AF_MSG      11  MAIN
20190      8/01/2015    11:36:09    11.2    55             +AF_MSG      11  MAIN
7208       8/01/2015    11:37:09    15.3    85             +AB_MSG      11  MAIN
7208       8/01/2015    11:38:39    15.3    86             +AB_MSG      11  MAIN
7754       8/01/2015    11:39:16    14.7    61             +AB_MSG      11  MAIN
8968       8/01/2015    11:39:39    15.5    91             +AB_MSG      11  MAIN
3737       8/01/2015    11:41:09    15.4    88             +AB_MSG      11  MAIN
9683       8/01/2015    11:41:39    14.9    69             +AF_MSG      11  MAIN
20729      8/01/2015    11:44:36    11.6    81             +AB_MSG      11  MAIN
9704       8/01/2015    11:45:20    14.9    68             +AF_MSG      11  MAIN
11111      8/01/2015    11:46:06    4.111   87             +SUB_MSG     11  PAN

I have the following python program that uses pandas to process this input

import sys
import csv
import operator
import os
from glob import glob
import fileinput
from relativeDates import *
import datetime
import math
import pprint
import numpy as np
import pandas as pd
from io import StringIO

COLLECTION = 'NEW'
BATTERY = r'C:\MyFolder\Analysis\\{}'.format(COLLECTION)
INPUT_FILE = Pandas + r'\in.csv'
OUTPUT_FILE = Pandas + r'\out.csv'


with open(INPUT_FILE) as fin:
    df = pd.read_csv(INPUT_FILE,
                  usecols=["SourceID", "local_date","local_time","Vge",'BSs','PC'],
                  header=0)


    #df.set_index(['SourceID','local_date','local_time','Vge','BSs','PC'],inplace=True)
    df.drop_duplicates(inplace=True)
    #df.reset_index(inplace=True)

    hour_list = []
    gb = df['local_time'].groupby(df['local_date'])
    for i in list(gb)[0][1]:
             hour_list.append(i.split(':')[0])
    for j in list(gb)[1][1]:
            hour_list.append(str(int(j.split(':')[0])+ 24))

    df['hour'] = pd.Series(hour_list,index=df.index)


    df.set_index(['SourceID','local_date','local_time','Vge'],inplace=True)

    #gb = df['hour'].groupby(df['PC'])
    #print(list(gb))
    gb = df['PC']
    class_list = []
    for msg in df['PC']:
        if 'SUB' in msg:
            class_list.append('SUB')
        else:
            class_list.append('MAIN')

    df['Type'] = pd.Series(class_list,index=df.index)


    print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean))
    gb = df['Type'].groupby(df['hour'])
    #print(list(gb))

    #print(list(df.groupby(['hour','Type']).count()))

    df.to_csv(OUTPUT_FILE)

I want to get an average of BSs field over time. This is what I am attempting to do in print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean)) above.

However few things needs to be considered.

Vge values can be classified into 2 types based on Type field.
The number of Vge values that we get can vary from hour to hour widely.
The whole data set is for 24 hours.
The Vge values can be recieved from a number of SourceIDs.
The Vge values can vary little bit among SourceID but should somewhat similar during the same time interval (same hour)

In such a situation calculation of simple mean as above print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean)) won't be sufficient as the number of samples during different time periods (hours) are different.

What function should be used in such a situation?

来源：https://stackoverflow.com/questions/33792915/pandas-mean-calculation-over-a-column-in-a-csv

标签

python

python-3.x

numpy

pandas

data-analysis