Reshaping from long to wide with some missing data (NA's) on time invariant variables

旧街凉风 提交于 2020-01-21 10:29:18

问题


When using stats:::reshape() from base to convert data from long to wide format, for any variables designated as time invariant, reshape just takes the first observed value, and if the variable is actually varying in some way, outputs a warning. In my case, I have missing data on variables I would like to designate as time invariant, but since I have this data at other time points, I would like the value from those time points to be used rather than the NA which is first observed.

testdata <- data.frame(matrix(c(1,1,2,3,4,3.5,NA,6,4,1,2,1), nrow = 3))
colnames(testdata) <- c("id", "process1", "timeinvariant", "time")
# > testdata
#   id process1 timeinvariant time
# 1  1      3.0            NA    1
# 2  1      4.0             6    2
# 3  2      3.5             4    1

# Note here the missing data on the invariant process at time 1
reshaped <- reshape(testdata, v.names="process1", direction = "wide")
# > reshaped
#   id timeinvariant process1.1 process1.2
# 1  1            NA        3.0          4
# 3  2             4        3.5         NA

The NA is propogated through to the wide format, when I would rather take the value observed at time 2 (or whenever).


回答1:


I don't know how to fix the problem but one way to fix the symptom would be to push the NA values down in the order.

testdata <- testdata[order(testdata$timeinvariant),]
testdata
#  id process1 timeinvariant time
#3  2      3.5             4    1
#2  1      4.0             6    2
#1  1      3.0            NA    1
reshaped<-reshape(testdata,v.names="process1",direction="wide")
reshaped
#  id timeinvariant process1.1 process1.2
#3  2             4        3.5         NA
#2  1             6        3.0          4

A more general solution will be to make sure there is only one value in the timevariant column per id

testdata$timeinvariant <- apply(testdata,1,function(x) max(testdata[testdata$id == x[1],"timeinvariant"],na.rm=T))
testdata
#  id process1 timeinvariant time
#3  2      3.5             4    1
#2  1      4.0             6    2
#1  1      3.0             6    1

This can be repeated for any number of columns before calling the reshape function. Hope this helps




回答2:


If there's always at least one non-missing value of timeinvariant for each id, and all (non-missing) values of timeinvariant are identical for each id (since it's time-invariant), couldn't you create a new column that fills in the NA values in timeinvariant and then reshape using that column? For example:

# Add another row to your data frame so that we'll have 2 NA values to deal with
td <- data.frame(matrix(c(1,1,2,1,3,4,3.5,4.5,NA,6,4,NA,1,2,1,3), nrow = 4))
colnames(td) <- c("id", "process1", "timeinvariant", "time")

# Create new column timeinvariant2, which fills in NAs from timeinvariant,
# then reshape using that column
library(dplyr)
td.wide = td %>%
  group_by(id) %>%
  mutate(timeinvariant2=max(timeinvariant, na.rm=TRUE)) %>%
  dcast(id + timeinvariant2 ~ time, value.var='process1')

# Paste "process1." onto names of all "time" columns
names(td.wide) = gsub("(^[0-9]*$)", "process1\\.\\1", names(td.wide) )

td.wide

  id timeinvariant2 process1.1 process1.2 process1.3
1  1              6        3.0          4        4.5
2  2              4        3.5         NA         NA


来源:https://stackoverflow.com/questions/24043381/reshaping-from-long-to-wide-with-some-missing-data-nas-on-time-invariant-vari

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!