问题
I have some data in a csv file as show below(only partial data is shown here).
SourceID local_date local_time Vge BSs PC hour Type
7208 8/01/2015 11:00:19 15.4 87 +BC_MSG 11 MAIN
11060 8/01/2015 11:01:56 14.9 67 +AB_MSG 11 MAIN
3737 8/01/2015 11:02:09 15.4 88 +AB_MSG 11 MAIN
9683 8/01/2015 11:07:19 14.9 69 +AB_MSG 11 MAIN
9276 8/01/2015 11:07:52 15.4 88 +AB_MSG 11 MAIN
7754 8/01/2015 11:09:26 14.7 62 +AF_MSG 11 MAIN
11111 8/01/2015 11:10:06 15.2 80 +AF_MSG 11 MAIN
9276 8/01/2015 11:10:52 15.4 88 +AB_MSG 11 MAIN
11111 8/01/2015 11:12:56 15.2 80 +AB_MSG 11 MAIN
6148 8/01/2015 11:15:29 15 70 +AB_MSG 11 MAIN
11111 8/01/2015 11:15:56 15.2 80 +AB_MSG 11 MAIN
9866 8/01/2015 11:16:28 4.102 80 +SUB_MSG 11 SUB
9866 8/01/2015 11:16:38 15.1 78 +AH_MSG 11 MAIN
9866 8/01/2015 11:16:38 4.086 78 +SUB_MSG 11 SUB
20729 8/01/2015 11:23:21 11.6 82 +AB_MSG 11 MAIN
9276 8/01/2015 11:25:52 15.4 88 +AB_MSG 11 MAIN
11111 8/01/2015 11:34:16 15.2 80 +AF_MSG 11 MAIN
20190 8/01/2015 11:36:09 11.2 55 +AF_MSG 11 MAIN
7208 8/01/2015 11:37:09 15.3 85 +AB_MSG 11 MAIN
7208 8/01/2015 11:38:39 15.3 86 +AB_MSG 11 MAIN
7754 8/01/2015 11:39:16 14.7 61 +AB_MSG 11 MAIN
8968 8/01/2015 11:39:39 15.5 91 +AB_MSG 11 MAIN
3737 8/01/2015 11:41:09 15.4 88 +AB_MSG 11 MAIN
9683 8/01/2015 11:41:39 14.9 69 +AF_MSG 11 MAIN
20729 8/01/2015 11:44:36 11.6 81 +AB_MSG 11 MAIN
9704 8/01/2015 11:45:20 14.9 68 +AF_MSG 11 MAIN
11111 8/01/2015 11:46:06 4.111 87 +SUB_MSG 11 PAN
I have the following python program that uses pandas to process this input
import sys
import csv
import operator
import os
from glob import glob
import fileinput
from relativeDates import *
import datetime
import math
import pprint
import numpy as np
import pandas as pd
from io import StringIO
COLLECTION = 'NEW'
BATTERY = r'C:\MyFolder\Analysis\\{}'.format(COLLECTION)
INPUT_FILE = Pandas + r'\in.csv'
OUTPUT_FILE = Pandas + r'\out.csv'
with open(INPUT_FILE) as fin:
df = pd.read_csv(INPUT_FILE,
usecols=["SourceID", "local_date","local_time","Vge",'BSs','PC'],
header=0)
#df.set_index(['SourceID','local_date','local_time','Vge','BSs','PC'],inplace=True)
df.drop_duplicates(inplace=True)
#df.reset_index(inplace=True)
hour_list = []
gb = df['local_time'].groupby(df['local_date'])
for i in list(gb)[0][1]:
hour_list.append(i.split(':')[0])
for j in list(gb)[1][1]:
hour_list.append(str(int(j.split(':')[0])+ 24))
df['hour'] = pd.Series(hour_list,index=df.index)
df.set_index(['SourceID','local_date','local_time','Vge'],inplace=True)
#gb = df['hour'].groupby(df['PC'])
#print(list(gb))
gb = df['PC']
class_list = []
for msg in df['PC']:
if 'SUB' in msg:
class_list.append('SUB')
else:
class_list.append('MAIN')
df['Type'] = pd.Series(class_list,index=df.index)
print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean))
gb = df['Type'].groupby(df['hour'])
#print(list(gb))
#print(list(df.groupby(['hour','Type']).count()))
df.to_csv(OUTPUT_FILE)
I want to get an average of BSs
field over time. This is what I am attempting to do in print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean))
above.
However few things needs to be considered.
Vge
values can be classified into2 types
based onType
field.- The number of
Vge
values that we get can vary from hour to hour widely. - The whole data set is for 24 hours.
- The
Vge
values can be recieved from a number ofSourceID
s. - The
Vge
values can vary little bit amongSourceID
but should somewhat similar during the same time interval (same hour)
In such a situation calculation of simple mean as above print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean))
won't be sufficient as the number of samples during different time periods (hours) are different.
What function should be used in such a situation?
来源:https://stackoverflow.com/questions/33792915/pandas-mean-calculation-over-a-column-in-a-csv