问题
When using stats:::reshape()
from base to convert data from long to wide format, for any variables designated as time invariant, reshape
just takes the first observed value, and if the variable is actually varying in some way, outputs a warning. In my case, I have missing data on variables I would like to designate as time invariant, but since I have this data at other time points, I would like the value from those time points to be used rather than the NA
which is first observed.
testdata <- data.frame(matrix(c(1,1,2,3,4,3.5,NA,6,4,1,2,1), nrow = 3))
colnames(testdata) <- c("id", "process1", "timeinvariant", "time")
# > testdata
# id process1 timeinvariant time
# 1 1 3.0 NA 1
# 2 1 4.0 6 2
# 3 2 3.5 4 1
# Note here the missing data on the invariant process at time 1
reshaped <- reshape(testdata, v.names="process1", direction = "wide")
# > reshaped
# id timeinvariant process1.1 process1.2
# 1 1 NA 3.0 4
# 3 2 4 3.5 NA
The NA
is propogated through to the wide format, when I would rather take the value observed at time 2 (or whenever).
回答1:
I don't know how to fix the problem but one way to fix the symptom would be to push the NA values down in the order.
testdata <- testdata[order(testdata$timeinvariant),]
testdata
# id process1 timeinvariant time
#3 2 3.5 4 1
#2 1 4.0 6 2
#1 1 3.0 NA 1
reshaped<-reshape(testdata,v.names="process1",direction="wide")
reshaped
# id timeinvariant process1.1 process1.2
#3 2 4 3.5 NA
#2 1 6 3.0 4
A more general solution will be to make sure there is only one value in the timevariant column per id
testdata$timeinvariant <- apply(testdata,1,function(x) max(testdata[testdata$id == x[1],"timeinvariant"],na.rm=T))
testdata
# id process1 timeinvariant time
#3 2 3.5 4 1
#2 1 4.0 6 2
#1 1 3.0 6 1
This can be repeated for any number of columns before calling the reshape function. Hope this helps
回答2:
If there's always at least one non-missing value of timeinvariant
for each id
, and all (non-missing) values of timeinvariant
are identical for each id
(since it's time-invariant), couldn't you create a new column that fills in the NA
values in timeinvariant
and then reshape using that column? For example:
# Add another row to your data frame so that we'll have 2 NA values to deal with
td <- data.frame(matrix(c(1,1,2,1,3,4,3.5,4.5,NA,6,4,NA,1,2,1,3), nrow = 4))
colnames(td) <- c("id", "process1", "timeinvariant", "time")
# Create new column timeinvariant2, which fills in NAs from timeinvariant,
# then reshape using that column
library(dplyr)
td.wide = td %>%
group_by(id) %>%
mutate(timeinvariant2=max(timeinvariant, na.rm=TRUE)) %>%
dcast(id + timeinvariant2 ~ time, value.var='process1')
# Paste "process1." onto names of all "time" columns
names(td.wide) = gsub("(^[0-9]*$)", "process1\\.\\1", names(td.wide) )
td.wide
id timeinvariant2 process1.1 process1.2 process1.3
1 1 6 3.0 4 4.5
2 2 4 3.5 NA NA
来源:https://stackoverflow.com/questions/24043381/reshaping-from-long-to-wide-with-some-missing-data-nas-on-time-invariant-vari