convert a irregular time series of a data table with factors into a regular time series in R

问题

I am trying to convert a irregular time series of a data table into a regular time series. My data looks like this

library(data.table)
dtRes <- data.table(time  = c(0.1, 0.8, 1, 2.3, 2.4, 4.8, 4.9),
                    abst  = c(1, 1, 1, 0, 0, 3, 3),
                    farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
                    gier  = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
                    goff  = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
                    huft  = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
                    mode  = c(4, 4, 4, 2.5, NA, 3, 3))

How is it possible to aggregate the observations in chunks by taking a chunk size of like 1 second? (with a variable number of rows - even 0 if there are no rows within a 1 second period) The result should be the mean for the numeric columns (NAs omitted) and for the factors a whole duplicated row if there is more than 1 unique value. If this is not possible for factors or doesn't make sense to you, it is also fine to just take the first value of the specific second in the factor column. This way it would be real regular time series without any duplicated times. If there is no value for an interval (like in the example for the 2nd second), the result is NA.

In the end the result can look like this (depends on duplicated rows or not):

with duplicates:

wiDups <- data.table(time  = c(1, 1, 2, 3, 4, 5, 5),
                     abst  = c(1, 1, NA, 1, NA, 5, 5),
                     farbe = as.factor(c("keine", "keine", NA, "keine", NA, "rot", "blau")),
                     gier  = c(2.5, 2.5, NA, 0, NA, 4.5, 4.5),
                     goff  = as.factor(c("haus", "maus", NA, "maus", NA, "maus", "maus")),
                     huft  = as.factor(c(NA, NA, NA, "wolle", NA, "holz", "holz")),
                     mode  = c(5, 5, NA, 2.5, NA, 4, 4))

and without duplicates:

noDups <- data.table(time  = c(1, 2, 3, 4, 5),
                     abst  = c(1, NA, 1, NA, 5),
                     farbe = as.factor(c("keine", NA, "keine", NA, "rot")),
                     gier  = c(2.5, NA, 0, NA, 4.5),
                     goff  = as.factor(c("haus", NA, "maus", NA, "maus")),
                     huft  = as.factor(c(NA, NA, "wolle", NA, "holz")),
                     mode  = c(5, NA, 2.5, NA, 4))

Is it better to convert it into a time series object?

回答1:

The question is a follow-up to OP's question R combining duplicate rows in a time series with different column types in a datatable.

The OP has requested to turn an irregular time series into a regular time series by aggregating.

This can be achieved by aggregating and filling in missing values by a right join.

win <- 1   # step size of time series or length of time window
brk <- dtRes[, .(time = tail(scales::fullseq(range(time), win), -1L))]
dtRes[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
               else unlist(na.omit(unique(x)))), 
      by = .(time = ceiling(time / win) * win)][
        brk, on = .(time)]

   time abst farbe gier goff  huft mode
1:    1    1 keine  2.5 haus  <NA>  4.0
2:    1    1 keine  2.5 maus  <NA>  4.0
3:    1    1 keine  2.5 toll  <NA>  4.0
4:    2   NA  <NA>   NA <NA>  <NA>   NA
5:    3    0 keine  0.0 maus wolle  2.5
6:    4   NA  <NA>   NA <NA>  <NA>   NA
7:    5    3   rot  3.0 maus  holz  3.0
8:    5    3  blau  3.0 maus  holz  3.0

For creating the regular time series, fullseq() from the scales package is used for convenience. Apparently, the OP prefers right closed intervals, so the first value can be skipped.

The warning message

In `[.data.table`(dtRes, , lapply(.SD, function(x) if (is.numeric(x)) mean(x, :
Item 5 of j's result for group 1 is zero length. This will be filled with 3 NAs to match the longest column in this result. Later groups may have a similar problem but only the first is reported to save filling the warning buffer.

can be happily ignored.

The warning is issued because huft is NA for time == 1. After calling na.omit() the result vector for the huft column is empty but the group result has 3 rows. So, data.table fills the result vector with NA to get a matching length -- which is what we expect.

The solution is parameterised to work with different chunk sizes win. For a chunk size of win <- 0.5, e.g., we get

    time abst farbe gier goff  huft mode
 1:  0.5    1 keine  2.5 haus  <NA>  4.0
 2:  1.0    1 keine  2.5 maus  <NA>  4.0
 3:  1.0    1 keine  2.5 toll  <NA>  4.0
 4:  1.5   NA  <NA>   NA <NA>  <NA>   NA
 5:  2.0   NA  <NA>   NA <NA>  <NA>   NA
 6:  2.5    0 keine  0.0 maus wolle  2.5
 7:  3.0   NA  <NA>   NA <NA>  <NA>   NA
 8:  3.5   NA  <NA>   NA <NA>  <NA>   NA
 9:  4.0   NA  <NA>   NA <NA>  <NA>   NA
10:  4.5   NA  <NA>   NA <NA>  <NA>   NA
11:  5.0    3   rot  3.0 maus  holz  3.0
12:  5.0    3  blau  3.0 maus  holz  3.0

with a lot more rows to fill in.

For a chunk size of win <- 2 we get

   time abst farbe gier goff  huft mode
1:    2    1 keine  2.5 haus  <NA>  4.0
2:    2    1 keine  2.5 maus  <NA>  4.0
3:    2    1 keine  2.5 toll  <NA>  4.0
4:    4    0 keine  0.0 maus wolle  2.5
5:    6    3   rot  3.0 maus  holz  3.0
6:    6    3  blau  3.0 maus  holz  3.0

A time series with multiple rows per time interval is not a regular time series, IMHO. With a slight modificiation, we can get

win <- 1
brk <- dtRes[, .(time = scales::fullseq(range(time), win)[-1L])]
dtRes[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
               else list(na.omit(unique(x)))), 
      by = .(time = ceiling(time / win) * win)][
        brk, on = .(time)]

   time abst    farbe gier           goff  huft mode
1:    1    1    keine  2.5 haus,maus,toll        4.0
2:    2   NA            NA                        NA
3:    3    0    keine  0.0           maus wolle  2.5
4:    4   NA            NA                        NA
5:    5    3 rot,blau  3.0           maus  holz  3.0

Now, there is only one row per timestep because multiple factor values have been aggregated in a list element.

回答2:

Here's a data.table answer:

Rounding up time to the nearest second:

> dtRes[, 
+       lapply(.SD, function(z) {return(ifelse(is.factor(z), levels(z)[unique(z)[1]], mean(z, na.rm = T)))} ), 
+       by = .(time = round(time, digits = 0))]
   time abst farbe gier goff  huft mode
1:    0    1 keine  2.5 haus  <NA>  4.0
2:    1    1 keine  2.5 maus  <NA>  4.0
3:    2    0 keine  0.0 maus wolle  2.5
4:    5    3   rot  3.0 maus  holz  3.0

Using the ceiling function:

> dtRes[, 
+       lapply(.SD, function(z) {return(ifelse(is.factor(z), levels(z)[unique(z)[1]], mean(z, na.rm = T)))} ), 
+       by = .(time = ceiling(time))]
   time abst farbe gier goff  huft mode
1:    1    1 keine  2.5 haus  <NA>  4.0
2:    3    0 keine  0.0 maus wolle  2.5
3:    5    3   rot  3.0 maus  holz  3.0

You can adjust the logic for returning the level based on what you want. Here I'm returning the level corresponding to the first non-unique value.

You may switch to using as.numeric in the ifelse statement - with similar results. I realized the data type for factor columns changes to character - if you need factor then you can set this specifically in a separate statement or use chaining.

dtRes[, lapply(.SD, ....), by = .(....)][, lapply(.SD, as.factor(...)), .SDcols = .( columns you want as factors), ]

回答3:

Not entirely sure what you want to do, but if I understood correctly something like this:

dtRes %>% 
  group_by(second = ceiling(time)) %>% 
  summarise(abst = mean(abst),
            farbe = farbe[1],
            gier = mean(gier),
            goff = goff[1],
            huft = huft[1],
            mode = mean(mode)) %>% 
  add_row(second = c(1:10)[!(c(1:10) %in% .$second)]) %>% #change 10 to however many seconds you will have
  arrange(second)

# A tibble: 10 x 7
   second  abst farbe  gier goff  huft   mode
    <dbl> <dbl> <fct> <dbl> <fct> <fct> <dbl>
 1      1     1 keine   2.5 haus  NA        4
 2      2    NA NA     NA   NA    NA       NA
 3      3     0 keine   0   maus  wolle    NA
 4      4    NA NA     NA   NA    NA       NA
 5      5     3 rot     3   maus  holz      3
 6      6    NA NA     NA   NA    NA       NA
 7      7    NA NA     NA   NA    NA       NA
 8      8    NA NA     NA   NA    NA       NA
 9      9    NA NA     NA   NA    NA       NA
10     10    NA NA     NA   NA    NA       NA

Note that from your example it is not clear how you round the time to seconds, but I think you always want to round up?

来源：https://stackoverflow.com/questions/62153998/convert-a-irregular-time-series-of-a-data-table-with-factors-into-a-regular-time

标签

datatable

time-series

xts

zoo