问题
I want to add missing observations in my panel data set, but keep running into memory issues.
I use the following code (based on this topic):
library(dplyr)
group_by(df, group) %>%
complete(time = full_seq(time 1L)) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)), -group, -time)
My data would look similar to the data in that topic, thus:
group time value
1 1 50
1 3 52
1 4 10
2 1 4
2 4 84
2 5 2
which I would like to look like
group time value
1 1 50
1 2 0
1 3 52
1 4 10
2 1 4
2 2 0
2 3 0
2 4 84
2 5 2
The problem is that I keep running out of memory (it is a only a 1 GB file with around 1.5 million observations). Any suggestions on how to do this differently?
回答1:
If memory is an issue, you might try it with the data.table-package:
Option 1:
# convert to a data.table
setDT(df)
# create a reference table
new.df <- df[, .(time = min(time):max(time)), group]
# join and replace the NA-values with zero's
new.df[df, value := i.value, on = names(new.df)][is.na(value), value := 0]
which gives:
> new.df
group time value
1: 1 1 50
2: 1 2 0
3: 1 3 52
4: 1 4 10
5: 2 1 4
6: 2 2 0
7: 2 3 0
8: 2 4 84
9: 2 5 2
Option 2:
setDT(df, key='time')[, .SD[J(min(time):max(time))], by = group
][is.na(value), value := 0][]
which gives:
group time value
1: 1 1 50
2: 1 2 0
3: 1 3 52
4: 1 4 10
5: 2 1 4
6: 2 2 0
7: 2 3 0
8: 2 4 84
9: 2 5 2
来源:https://stackoverflow.com/questions/36141557/missing-time-values-in-r-memory-issues