问题
I am trying to convert a irregular time series of a data table into a regular time series. My data looks like this
library(data.table)
dtRes <- data.table(time = c(0.1, 0.8, 1, 2.3, 2.4, 4.8, 4.9),
abst = c(1, 1, 1, 0, 0, 3, 3),
farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
gier = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
goff = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
huft = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
mode = c(4, 4, 4, 2.5, NA, 3, 3))
How is it possible to aggregate the observations in chunks by taking a chunk size of like 1 second? (with a variable number of rows - even 0 if there are no rows within a 1 second period) The result should be the mean for the numeric columns (NAs omitted) and for the factors a whole duplicated row if there is more than 1 unique value. If this is not possible for factors or doesn't make sense to you, it is also fine to just take the first value of the specific second in the factor column. This way it would be real regular time series without any duplicated times. If there is no value for an interval (like in the example for the 2nd second), the result is NA.
In the end the result can look like this (depends on duplicated rows or not):
with duplicates:
wiDups <- data.table(time = c(1, 1, 2, 3, 4, 5, 5),
abst = c(1, 1, NA, 1, NA, 5, 5),
farbe = as.factor(c("keine", "keine", NA, "keine", NA, "rot", "blau")),
gier = c(2.5, 2.5, NA, 0, NA, 4.5, 4.5),
goff = as.factor(c("haus", "maus", NA, "maus", NA, "maus", "maus")),
huft = as.factor(c(NA, NA, NA, "wolle", NA, "holz", "holz")),
mode = c(5, 5, NA, 2.5, NA, 4, 4))
and without duplicates:
noDups <- data.table(time = c(1, 2, 3, 4, 5),
abst = c(1, NA, 1, NA, 5),
farbe = as.factor(c("keine", NA, "keine", NA, "rot")),
gier = c(2.5, NA, 0, NA, 4.5),
goff = as.factor(c("haus", NA, "maus", NA, "maus")),
huft = as.factor(c(NA, NA, "wolle", NA, "holz")),
mode = c(5, NA, 2.5, NA, 4))
Is it better to convert it into a time series object?
回答1:
The question is a follow-up to OP's question R combining duplicate rows in a time series with different column types in a datatable.
The OP has requested to turn an irregular time series into a regular time series by aggregating.
This can be achieved by aggregating and filling in missing values by a right join.
win <- 1 # step size of time series or length of time window
brk <- dtRes[, .(time = tail(scales::fullseq(range(time), win), -1L))]
dtRes[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE)
else unlist(na.omit(unique(x)))),
by = .(time = ceiling(time / win) * win)][
brk, on = .(time)]
time abst farbe gier goff huft mode 1: 1 1 keine 2.5 haus <NA> 4.0 2: 1 1 keine 2.5 maus <NA> 4.0 3: 1 1 keine 2.5 toll <NA> 4.0 4: 2 NA <NA> NA <NA> <NA> NA 5: 3 0 keine 0.0 maus wolle 2.5 6: 4 NA <NA> NA <NA> <NA> NA 7: 5 3 rot 3.0 maus holz 3.0 8: 5 3 blau 3.0 maus holz 3.0
For creating the regular time series, fullseq()
from the scales
package is used for convenience. Apparently, the OP prefers right closed intervals, so the first value can be skipped.
The warning message
In
`[.data.table`(dtRes, , lapply(.SD, function(x) if (is.numeric(x)) mean(x,
:
Item 5 of j's result for group 1 is zero length. This will be filled with 3 NAs to match the longest column in this result. Later groups may have a similar problem but only the first is reported to save filling the warning buffer.
can be happily ignored.
The warning is issued because huft
is NA
for time == 1
. After calling na.omit()
the result vector for the huft
column is empty but the group result has 3 rows. So, data.table
fills the result vector with NA
to get a matching length -- which is what we expect.
The solution is parameterised to work with different chunk sizes win
. For a chunk size of win <- 0.5
, e.g., we get
time abst farbe gier goff huft mode 1: 0.5 1 keine 2.5 haus <NA> 4.0 2: 1.0 1 keine 2.5 maus <NA> 4.0 3: 1.0 1 keine 2.5 toll <NA> 4.0 4: 1.5 NA <NA> NA <NA> <NA> NA 5: 2.0 NA <NA> NA <NA> <NA> NA 6: 2.5 0 keine 0.0 maus wolle 2.5 7: 3.0 NA <NA> NA <NA> <NA> NA 8: 3.5 NA <NA> NA <NA> <NA> NA 9: 4.0 NA <NA> NA <NA> <NA> NA 10: 4.5 NA <NA> NA <NA> <NA> NA 11: 5.0 3 rot 3.0 maus holz 3.0 12: 5.0 3 blau 3.0 maus holz 3.0
with a lot more rows to fill in.
For a chunk size of win <- 2
we get
time abst farbe gier goff huft mode 1: 2 1 keine 2.5 haus <NA> 4.0 2: 2 1 keine 2.5 maus <NA> 4.0 3: 2 1 keine 2.5 toll <NA> 4.0 4: 4 0 keine 0.0 maus wolle 2.5 5: 6 3 rot 3.0 maus holz 3.0 6: 6 3 blau 3.0 maus holz 3.0
A time series with multiple rows per time interval is not a regular time series, IMHO. With a slight modificiation, we can get
win <- 1
brk <- dtRes[, .(time = scales::fullseq(range(time), win)[-1L])]
dtRes[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE)
else list(na.omit(unique(x)))),
by = .(time = ceiling(time / win) * win)][
brk, on = .(time)]
time abst farbe gier goff huft mode 1: 1 1 keine 2.5 haus,maus,toll 4.0 2: 2 NA NA NA 3: 3 0 keine 0.0 maus wolle 2.5 4: 4 NA NA NA 5: 5 3 rot,blau 3.0 maus holz 3.0
Now, there is only one row per timestep because multiple factor values have been aggregated in a list element.
回答2:
Here's a data.table
answer:
Rounding up time
to the nearest second:
> dtRes[,
+ lapply(.SD, function(z) {return(ifelse(is.factor(z), levels(z)[unique(z)[1]], mean(z, na.rm = T)))} ),
+ by = .(time = round(time, digits = 0))]
time abst farbe gier goff huft mode
1: 0 1 keine 2.5 haus <NA> 4.0
2: 1 1 keine 2.5 maus <NA> 4.0
3: 2 0 keine 0.0 maus wolle 2.5
4: 5 3 rot 3.0 maus holz 3.0
Using the ceiling
function:
> dtRes[,
+ lapply(.SD, function(z) {return(ifelse(is.factor(z), levels(z)[unique(z)[1]], mean(z, na.rm = T)))} ),
+ by = .(time = ceiling(time))]
time abst farbe gier goff huft mode
1: 1 1 keine 2.5 haus <NA> 4.0
2: 3 0 keine 0.0 maus wolle 2.5
3: 5 3 rot 3.0 maus holz 3.0
You can adjust the logic for returning the level
based on what you want. Here I'm returning the level
corresponding to the first non-unique value.
You may switch to using as.numeric
in the ifelse
statement - with similar results. I realized the data type for factor
columns changes to character
- if you need factor
then you can set this specifically in a separate statement or use chaining.
dtRes[, lapply(.SD, ....), by = .(....)][, lapply(.SD, as.factor(...)), .SDcols = .( columns you want as factors), ]
回答3:
Not entirely sure what you want to do, but if I understood correctly something like this:
dtRes %>%
group_by(second = ceiling(time)) %>%
summarise(abst = mean(abst),
farbe = farbe[1],
gier = mean(gier),
goff = goff[1],
huft = huft[1],
mode = mean(mode)) %>%
add_row(second = c(1:10)[!(c(1:10) %in% .$second)]) %>% #change 10 to however many seconds you will have
arrange(second)
# A tibble: 10 x 7
second abst farbe gier goff huft mode
<dbl> <dbl> <fct> <dbl> <fct> <fct> <dbl>
1 1 1 keine 2.5 haus NA 4
2 2 NA NA NA NA NA NA
3 3 0 keine 0 maus wolle NA
4 4 NA NA NA NA NA NA
5 5 3 rot 3 maus holz 3
6 6 NA NA NA NA NA NA
7 7 NA NA NA NA NA NA
8 8 NA NA NA NA NA NA
9 9 NA NA NA NA NA NA
10 10 NA NA NA NA NA NA
Note that from your example it is not clear how you round the time to seconds, but I think you always want to round up?
来源:https://stackoverflow.com/questions/62153998/convert-a-irregular-time-series-of-a-data-table-with-factors-into-a-regular-time