问题
I would like to subsample a data frame at hourly intervals from a datetime column, beginning with the time value in the first row of the data frame. My data frame runs at 10-minute intervals from the first to the last row. Example data is below:
structure(list(datetime = structure(1:19, .Label = c("30/03/2011 05:09",
"30/03/2011 05:19", "30/03/2011 05:29", "30/03/2011 05:39", "30/03/2011 05:49",
"30/03/2011 05:59", "30/03/2011 06:09", "30/03/2011 06:19", "30/03/2011 06:29",
"30/03/2011 06:39", "30/03/2011 06:49", "30/03/2011 06:59", "30/03/2011 07:09",
"30/03/2011 07:19", "30/03/2011 07:29", "30/03/2011 07:39", "30/03/2011 07:49",
"30/03/2011 07:59", "30/03/2011 08:09"), class = "factor"), a_count = c(66L,
34L, 33L, 20L, 12L, 44L, 36L, 29L, 21L, 22L, 17L, 38L, 24L, 19L,
60L, 54L, 27L, 36L, 45L), b_count = c(166.49, 167.54, 168.31,
168.81, 169.24, 169.61, 169.96, 170.29, 170.63, 170.98, 171.31,
171.62, 171.94, 172.29, 172.68, 173.15, 173.71, 174.34, 174.99
)), .Names = c("datetime", "a_count", "b_count"), class = "data.frame", row.names = c(NA,
-19L))
df
datetime a_count b_count
1 30/09/2011 05:09 66 166.49
2 30/09/2011 05:19 34 167.54
3 30/09/2011 05:29 33 168.31
4 30/09/2011 05:39 20 168.81
5 30/09/2011 05:49 12 169.24
6 30/09/2011 05:59 44 169.61
7 30/09/2011 06:09 36 169.96
8 30/09/2011 06:19 29 170.29
9 30/09/2011 06:29 21 170.63
10 30/09/2011 06:39 22 170.98
11 30/09/2011 06:49 17 171.31
12 30/09/2011 06:59 38 171.62
13 30/09/2011 07:09 24 171.94
14 30/09/2011 07:19 19 172.29
15 30/09/2011 07:29 60 172.68
16 30/09/2011 07:39 54 173.15
17 30/09/2011 07:49 27 173.71
18 30/09/2011 07:59 36 174.34
19 30/09/2011 08:09 45 174.99
I would like to end up with the following data frame:
datetime a_count b_count
30/09/2011 05:09 66 166.49
30/09/2011 06:09 36 169.96
30/09/2011 07:09 24 171.94
30/09/2011 08:09 45 174.99
Any suggestions would be greatly appreciated!
回答1:
It is hard to guess what structure you have. Is it guaranteed that you have one value at exactly the first time value + x times 60 minutes? What happens if the value can not be found? What happens if you have two values at that time. Do you need approximate matching? Say, 09:10 is counted as 09:09?
On idea to get you started is the following:
# I will call your dataframe `d`.
# Transform datetime to a POSIXct object, R's datatype for timestamps
d$datetime <- as.POSIXct(as.character(d$datetime), format='%d/%m/%Y %H:%M')
# Extract the minutes
d$minute <- as.numeric(format(d$datetime, '%M'))
# And select by identical minute.
subset(d, minute == d$minute[1])
回答2:
> df$datetime <- strptime(df$datetime, format = "%d/%m/%Y %H:%M")
> df$dif <- c(0, cumsum(as.numeric(diff(df$datetime))))
>
> df[df$dif %% 60 == 0,]
datetime a_count b_count dif
2011-03-30 05:09:00 66 166.49 0
2011-03-30 06:09:00 36 169.96 60
2011-03-30 07:09:00 24 171.94 120
2011-03-30 08:09:00 45 174.99 180
I have the same questions as Thilo, but heres another solution.
回答3:
You can also use the lubridate packages to change the format of your times which may be a bit more intutitive and easy to remember.
Also, you can add variables based on the hour, and then summarize how you would like with plyr.
in the example below I took the sum and mean of a_count. May need to vary based on your purpose.
library(plyr)
library(lubridate)
df2 <- mutate(df, dt = dmy_hm(as.character(datetime)), hour = hour(dt), minute = minute(dt))
summary <- ddply(df2, .(hour), summarize, a_mean = mean(a_count), a_sum = sum(a_count))
来源:https://stackoverflow.com/questions/19668391/how-to-subsample-a-data-frame-based-on-a-datetime-column-in-r