问题
Here is an example of my dataset. I want to calculate bin average based on time (i.e., ts) every 10 seconds. Could you please provide some hints so that I can carry on?
In my case, I want to average time (ts) and Var in every 10 seconds. For example, I will get an averaged value of Var and ts from 0 to 10 seconds; I will get another averaged value of Var and ts from 11 to 20 seconds, etc.
df = data.frame(ts = seq(1,100,by=0.5), Var = runif(199,1, 10))
Any functions or libraries in R can I use for this task?
回答1:
There are many ways to calculate a binned average: with base aggregate
,by
, with the packages dplyr
, data.table
, probably with zoo
and surely other timeseries packages...
library(dplyr)
df %>%
group_by(interval = round(df$ts/10)*10) %>%
summarize(Var_mean = mean(Var))
# A tibble: 11 x 2
interval Var_mean
<dbl> <dbl>
1 0 4.561653
2 10 6.544980
3 20 6.110336
4 30 4.288523
5 40 5.339249
6 50 6.811147
7 60 6.180795
8 70 4.920476
9 80 5.486937
10 90 5.284871
11 100 5.917074
That's the dplyr approach, see how it and data.table let you name the intermediate variables, which keeps code clean and legible.
回答2:
In general, I agree with @smci, the dplyr
and data.table
approach is the best here. Let me elaborate a bit further.
# the dplyr way
library(dplyr)
df %>%
group_by(interval = ceiling(seq_along(ts)/20)) %>%
summarize(variable_mean = mean(Var))
# the data.table way
library(data.table)
dt <- data.table(df)
dt[,list(Var_mean = mean(Var)),
by = list(interval = ceiling(seq_along(dt$ts)/20))]
I would not go to the traditional time series solutions like ts
, zoo
or xts
here. Their methods are more suitable to handle regular frequencies and frequency like monthly or quarterly data. Apart from ts
they can handle irregular frequencies and also high frequency data, but many methods such as the print methods don't work well or least do not give you an advantage over data.table
or data.frame
.
As long as you're just aggregating and grouping both data.table
and dplyr
are also likely faster in terms of performance. Guess data.table
has the edge over dplyr
in terms of speed, but you would have benchmark / profile that, e.g. using microbenchmark
. So if you're not working with a classic R time series format anyway, there's no reason to go to these for aggregating.
回答3:
Assuming df
in the question, convert to a zoo object and then aggregate.
The second argument of aggregate.zoo
is a vector the same length as the time vector giving the new times that each original time is to be mapped to. The third argument is applied to all time series values whose times have been mapped to the same value. This mapping could be done in various ways but here we have chosen to map times (0, 10] to 10, (10, 20] to 20, etc. by using 10 * ceiling(time(z) / 10)
.
In light of some of the other comments in the answers let me point out that in contrast to using a data frame there is significant simplification here, firstly because the data has been reduced to one dimension (vs. 2 in a data.frame), secondly because it is more conducive to the whole object approach whereas with data frames one needs to continually pick apart the object and work on those parts and thirdly because one now has all the facilities of zoo to manipulate the time series such as numerous NA removal schemes, rolling functions, overloaded arithmetic operators, n-way merges, simple access to classic, lattice and ggplot2 graphics, design which emphasizes consistency with base R making it easy to learn and extensive documentation including 5 vignettes plus help files with numerous examples and likely very few bugs given the 14 years of development and widespread use.
library(zoo)
z <- read.zoo(df)
z10 <- aggregate(z, 10 * ceiling(time(z) / 10), mean)
giving:
> z10
10 20 30 40 50 60 70 80
5.629926 6.571754 5.519487 5.641534 5.309415 5.793066 4.890348 5.509859
90 100
4.539044 5.480596
(Note that the data in the question is not reproducible because it used random numbers without set.seed
so if you try to repeat the above you won't get an identical answer.)
Now we could plot it, say, using any of these:
plot(z10)
library(lattice)
xyplot(z10)
library(ggplot2)
autoplot(z10)
来源:https://stackoverflow.com/questions/48837016/timeseries-average-based-on-a-defined-time-interval-bin