Reshape messy longitudinal survey data containing multiple different variables, wide to long

天涯浪子 提交于 2019-12-20 05:31:36

问题


I hope that I'm not recreating the wheel, and do not think that the following can be answered using reshape.

I have messy longitudinal survey data, that I want to convert from wide to long format. By messy I mean:

  • I have a mixture of variable types (numeric, factor, logical)
  • Not all variables have been collected at every timepoint.

For example:

data <- read.table(header=T, text='
  id inlove.1 inlove.2 income.2 income.3 mood.1 mood.3 random
  1      TRUE    FALSE 87717.76 82281.25  happy  happy filler
  2      TRUE     TRUE 70795.53 54995.19  so-so  happy filler
  3     FALSE    FALSE 48012.77 47650.47    sad  so-so filler
 ')

I could not work out how to reshape the data using reshape, and keep getting the error message 'times' is wrong length. Which I assume is because not every variable is recorded on every occasion. Also I don't think melt and cast from reshape2 will work as it requires all measured variables to be of the same type.

I came up with the following solution which may help others. It selects variables by timepoint, renames them, and then uses rbind.fill from plyr to concatenate them together. But I wonder if I'm missing something with reshape or if this can be done easier using tidyr or another package?

reshapeLong2 <- function(data, varying = NULL, timevar = "time", idvar = "id", sep = ".", patterns = NULL) {

  require(plyr)
  substrRight <- function(x, n){
    substr(x, nchar(x)-n+1, nchar(x))
  }

  if (is.null(varying))
    varying <- names(data)[! names(data) %in% idvar]

  # Create pattern if not specified, guesses by taking numbers given at end of variable names.
  if (is.null(patterns)) {
    times <- unique(na.omit(as.numeric(substrRight(varying, 1))))
    times <- times[order = times]
    patterns <- paste0(sep, times)    
  }

  # Create list of datasets by study time
  ls.df <- lapply(patterns, function(pattern) {
    var.old <- grep(pattern, x = varying, value = TRUE)
    var.new <- gsub(pattern, "", x = var.old)
    df <- data[, c(idvar, var.old)]
    names(df) <- c(idvar, var.new)
    df[, timevar] <- match(pattern, patterns)
    return(df)
  })

  # Concatenate datasets together
  dfs <- rbind.fill(ls.df)
  return(dfs)
}

> reshapeLong2(df.test)
  id inlove  mood time   income
1  1  FALSE   sad    1       NA
2  2   TRUE so-so    1       NA
3  3   TRUE   sad    1       NA
4  1   TRUE  <NA>    2 27766.13
5  2  FALSE  <NA>    2 74395.30
6  3   TRUE  <NA>    2 89004.95
7  1     NA   sad    3 27270.07
8  2     NA so-so    3 36971.64
9  3     NA so-so    3 85986.96
Warning message:
In na.omit(as.numeric(substrRight(varying, 1))) :
  NAs introduced by coercion

Note, warning message indicates that there are some variables that are dropped (in this case "random"). Warning not shown if all variables are listed as either idvar or varying.


回答1:


If you fill in varname.TIME columns as NA for all the missing times, you can then just reshape like:

uniqnames <- c("inlove","income","mood")
allnames  <- make.unique(rep(uniqnames,4))[-(seq_along(uniqnames))]
#[1] "inlove.1" "income.1" "mood.1"   "inlove.2" "income.2" "mood.2" ...
data[setdiff(allnames, names(data)[-1])] <- NA
#  id inlove.1 inlove.2 income.2 income.3 mood.1 mood.3 random income.1 mood.2 inlove.3
#1  1     TRUE    FALSE 87717.76 82281.25  happy  happy filler       NA     NA       NA
#2  2     TRUE     TRUE 70795.53 54995.19  so-so  happy filler       NA     NA       NA
#3  3    FALSE    FALSE 48012.77 47650.47    sad  so-so filler       NA     NA       NA

reshape(data, idvar="id", direction="long", sep=".", varying=allnames)

#    id random time inlove   income  mood
#1.1  1 filler    1   TRUE       NA happy
#2.1  2 filler    1   TRUE       NA so-so
#3.1  3 filler    1  FALSE       NA   sad
#1.2  1 filler    2  FALSE 87717.76  <NA>
#2.2  2 filler    2   TRUE 70795.53  <NA>
#3.2  3 filler    2  FALSE 48012.77  <NA>
#1.3  1 filler    3     NA 82281.25 happy
#2.3  2 filler    3     NA 54995.19 happy
#3.3  3 filler    3     NA 47650.47 so-so


来源:https://stackoverflow.com/questions/34713846/reshape-messy-longitudinal-survey-data-containing-multiple-different-variables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!