Missing time values in R - memory issues

问题

I want to add missing observations in my panel data set, but keep running into memory issues.

I use the following code (based on this topic):

library(dplyr)

group_by(df, group) %>% 
  complete(time = full_seq(time 1L)) %>%
  mutate_each(funs(replace(., which(is.na(.)), 0)), -group, -time)

My data would look similar to the data in that topic, thus:

group   time           value
1           1          50
1           3          52
1           4          10
2           1          4
2           4          84
2           5          2

which I would like to look like

group  time        value
1           1       50
1           2       0
1           3       52
1           4       10
2           1       4
2           2       0
2           3       0
2           4       84
2           5       2

The problem is that I keep running out of memory (it is a only a 1 GB file with around 1.5 million observations). Any suggestions on how to do this differently?

回答1:

If memory is an issue, you might try it with the data.table-package:

Option 1:

# convert to a data.table
setDT(df)
# create a reference table
new.df <- df[, .(time = min(time):max(time)), group]
# join and replace the NA-values with zero's
new.df[df, value := i.value, on = names(new.df)][is.na(value), value := 0]

which gives:

> new.df
   group time value
1:     1    1    50
2:     1    2     0
3:     1    3    52
4:     1    4    10
5:     2    1     4
6:     2    2     0
7:     2    3     0
8:     2    4    84
9:     2    5     2

Option 2:

setDT(df, key='time')[, .SD[J(min(time):max(time))], by = group
                      ][is.na(value), value := 0][]

which gives:

   group time value
1:     1    1    50
2:     1    2     0
3:     1    3    52
4:     1    4    10
5:     2    1     4
6:     2    2     0
7:     2    3     0
8:     2    4    84
9:     2    5     2

来源：https://stackoverflow.com/questions/36141557/missing-time-values-in-r-memory-issues

标签

memory

out-of-memory

time-series

dplyr