问题
I want to calculate the average for each hours using a CSV file:
Below is my DATA SET:
Timestamp Temperature
9/1/2016 0:00:08 53.8
9/1/2016 0:00:38 53.8
9/1/2016 0:01:08 53.8
9/1/2016 0:01:38 53.8
9/1/2016 0:02:08 53.8
9/1/2016 0:02:38 54.1
9/1/2016 0:03:08 54.1
9/1/2016 0:03:38 54.1
9/1/2016 0:04:38 54
9/1/2016 0:05:38 54
9/1/2016 0:06:08 54
9/1/2016 0:06:38 54
9/1/2016 0:07:08 54
9/1/2016 0:07:38 54
9/1/2016 0:08:08 54.1
9/1/2016 0:08:38 54.1
9/1/2016 0:09:38 54.1
9/1/2016 0:10:32 54
9/1/2016 0:11:02 54
9/1/2016 0:11:32 54
9/1/2016 0:00:08 54
9/2/2016 0:00:20 32
9/2/2016 0:00:50 32
9/2/2016 0:01:20 32
9/2/2016 0:01:50 32
9/2/2016 0:02:20 32
9/2/2016 0:02:50 32
9/2/2016 0:03:20 32
9/2/2016 0:03:50 32
9/2/2016 0:04:20 32
9/2/2016 0:04:50 32
9/2/2016 0:05:20 32
9/2/2016 0:05:50 32
9/2/2016 0:06:20 32
9/2/2016 0:06:50 32
9/2/2016 0:07:20 32
9/2/2016 0:07:50 32
Here is my code for calculating per day average, but I want per hour:
from datetime import datetime
import pandas
def same_day(date_string): # Remove year
return datetime.strptime(date_string, "%m/%d/%Y %H:%M%S").strftime(%m%d')
df = pandas.read_csv('/home/kk/Desktop/cal_Avg.csv',index_col=0,usecols=[0, 1], names=['Timestamp', 'Discharge'],converters={'Timestamp': same_day})
print(df.groupby(level=0).mean())
My desired output is like:
Timestamp Temp * Avg
9/1/2016 0:00:08 53.8
9/1/2016 0:00:38 53.8 ?avg for this hour
9/1/2016 0:01:08 53.8
9/1/2016 0:01:38 53.8 ?avg for this hour
9/1/2016 0:02:08 53.8
9/1/2016 0:02:38 54.1
Now I want the average for specific hours , Min
Desired output:
Here I am printing only 5 hours output for date 01-09-2016 and 02-09-16
010900 54.362727 45.497273
010901 54.723276 45.068103
010902 54.746847 45.370270
010903 54.833913 44.931304
010904 54.971053 44.835088
010905 55.519444 44.459259
020901 31.742553 55.640426
020902 31.495556 55.655556
020903 31.304348 55.442609
020904 31.200000 55.437273
020905 31.294382 55.442697
Specific date and there specific hours? How do I archive this?
回答1:
I think you need first read_csv with parameters index_col=[0]
for read first column to index
and parse_dates=[0]
for parse first column to DatetimeIndex
:
df = pd.read_csv('filename', index_col=[0], parse_dates=[0],, usecols=[0,1])
print (df)
Temperature
Timestamp
2016-09-01 00:00:08 53.8
2016-09-01 00:00:38 53.8
2016-09-01 00:01:08 53.8
2016-09-01 00:01:38 53.8
2016-09-01 00:02:08 53.8
2016-09-01 00:02:38 54.1
2016-09-01 00:03:08 54.1
...
...
Then use resample by hours
and aggregate Resampler.mean, but get NaN
for missing data in DatetimeIndex
:
print (df.resample('H').mean())
Temperature
Timestamp
2016-09-01 00:00:00 53.980952
2016-09-01 01:00:00 NaN
2016-09-01 02:00:00 NaN
2016-09-01 03:00:00 NaN
2016-09-01 04:00:00 NaN
2016-09-01 05:00:00 NaN
2016-09-01 06:00:00 NaN
2016-09-01 07:00:00 NaN
2016-09-01 08:00:00 NaN
2016-09-01 09:00:00 NaN
2016-09-01 10:00:00 NaN
2016-09-01 11:00:00 NaN
2016-09-01 12:00:00 NaN
2016-09-01 13:00:00 NaN
2016-09-01 14:00:00 NaN
2016-09-01 15:00:00 NaN
2016-09-01 16:00:00 NaN
2016-09-01 17:00:00 NaN
2016-09-01 18:00:00 NaN
2016-09-01 19:00:00 NaN
2016-09-01 20:00:00 NaN
2016-09-01 21:00:00 NaN
2016-09-01 22:00:00 NaN
2016-09-01 23:00:00 NaN
2016-09-02 00:00:00 32.000000
Another solution is remove minutes
and seconds
by casting to hours
and groupby
by this array
:
print (df.index.values.astype('<M8[h]'))
['2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
'2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
'2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
'2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
'2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
'2016-09-01T00' '2016-09-02T00' '2016-09-02T00' '2016-09-02T00'
'2016-09-02T00' '2016-09-02T00' '2016-09-02T00' '2016-09-02T00'
'2016-09-02T00' '2016-09-02T00' '2016-09-02T00' '2016-09-02T00'
'2016-09-02T00' '2016-09-02T00' '2016-09-02T00' '2016-09-02T00'
'2016-09-02T00']
print (df.groupby([df.index.values.astype('<M8[h]')]).mean())
Temperature
2016-09-01 53.980952
2016-09-02 32.000000
Also if need meean by months, days and hours is posible groupby
by DatetimeIndex.strftime:
print (df.index.strftime('%m%d%H'))
['090100' '090100' '090100' '090100' '090100' '090100' '090100' '090100'
'090100' '090100' '090100' '090100' '090100' '090100' '090100' '090100'
'090100' '090100' '090100' '090100' '090100' '090200' '090200' '090200'
'090200' '090200' '090200' '090200' '090200' '090200' '090200' '090200'
'090200' '090200' '090200' '090200' '090200']
print (df.groupby([df.index.strftime('%m%d%H')]).mean())
Temperature
090100 53.980952
090200 32.000000
Or if need mean only by hours groupby
by DatetimeIndex.hour:
print (df.index.hour)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
print (df.groupby([df.index.hour]).mean())
Temperature
0 44.475676
回答2:
I would first define a new column hour
for readability, then groupBy
it
df = pd.DataFrame.from_csv('/home/kk/Desktop/cal_Avg.csv',index_col=None)
df['hour']=df['Timestamp'].apply(lambda s:s[:-3])
df[['hour','Temprature']].groupBy('hour').mean()
来源:https://stackoverflow.com/questions/40256020/python-calculate-average-for-each-hour-in-csv