How to create missing value for repeated measurement data?

和自甴很熟 提交于 2019-11-30 23:08:52

Using tidyr, this is a one liner. You use the complete function, which creates rows with each combination of the columns passed to it, filling the rest of the rows with NA:

library(tidyr)
complete(m, id, age)

Source: local data frame [18 x 3]

      id   age    IQ
   (dbl) (dbl) (dbl)
1      1     2     3
2      1     3     4
3      1     4     5
4      1     5     4
5      1     6    NA
6      1     8    NA
7      2     2    NA
8      2     3     6
9      2     4    NA
10     2     5    NA
11     2     6     5
12     2     8    NA
13     3     2     3
14     3     3    NA
15     3     4    NA
16     3     5     8
17     3     6    NA
18     3     8    10
akrun

We could do this using data.table. We convert the data.frame to data.table (setDT(m)), set the key columns (setkey) and join with the cross-join of unique elements of 'id' and 'age'

library(data.table)
setkey(setDT(m), id, age)[CJ(unique(id), unique(age))]
#    id age IQ
# 1:  1   2  3
# 2:  1   3  4
# 3:  1   4  5
# 4:  1   5  4
# 5:  1   6 NA
# 6:  1   8 NA
# 7:  2   2 NA
# 8:  2   3  6
# 9:  2   4 NA
#10:  2   5 NA
#11:  2   6  5
#12:  2   8 NA
#13:  3   2  3
#14:  3   3 NA
#15:  3   4 NA
#16:  3   5  8
#17:  3   6 NA
#18:  3   8 10

In the devel version i.e. v1.9.5, we can use unique=TRUE within CJ (from @Frank's comment)

setDT(m, key=c('id', 'age'))[CJ(id, age, unique=TRUE)]

Benchmarks

set.seed(24)
m1 <- data.frame(id=rep(1:10000, each=10), age=sample(2:400, 10000*10, 
         replace=TRUE), IQ=rnorm(10000*10))
system.time(res1 <- complete(m1, id, age))
# user  system elapsed 
#18.888   0.000  16.258 


system.time({ DT <- as.data.table(m1)
         res2 <- setkey(DT, id, age)[CJ(unique(id), unique(age))]})
#  user  system elapsed 
#  0.000   0.000   0.279 



library(microbenchmark)
jeremy <- function() complete(m1, id, age)
akrun <- function() {DT <- as.data.table(m1)
   setkey(DT, id, age)[CJ(unique(id), unique(age))]}

microbenchmark(jeremy(), akrun(), times=20L, unit='relative')
#Unit: relative
#   expr      min       lq   mean   median       uq      max neval cld
#jeremy() 24.95042 30.84234 17.138 23.09175 12.16891 8.305394    20   b
# akrun()  1.00000  1.00000  1.000  1.00000  1.00000 1.000000    20  a 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!