It is a truth universally acknowledged that R\'s base reshape command is speedy and powerful but has miserable syntax. I have therefore written a quick wrapper around it wh
I think there might be a mistake in your example. For going from wide to long, I get the following error:
> reshapeasy( x.wide, "long", NULL, id="surveyNum", vary="id", sep="_" )
Error in gsub(paste("[", paste(omit, collapse = "", sep = ""), "]$", sep = ""), :
invalid regular expression '[]$', reason 'Missing ']''
Removing the NULL
corrects the problem. Which leads me to ask, what is the intended purpose of that NULL
?
I also think that the function would be improved if it generated a time
variable by default, if not explicitly specified by the user (as is done in reshape()
).
See, for instance, the following from base reshpae()
:
> head(reshape(x.wide, direction="long", idvar=1, varying=2:13, sep="_"))
surveyNum time pio caremgmt prev price
1.1 1 1 2 2 1 2
2.1 2 1 2 1 2 1
3.1 3 1 1 1 2 2
4.1 4 1 2 2 1 5
5.1 5 1 1 1 1 3
6.1 6 1 1 2 2 4
If I'm familiar with this, and I see that your function takes care of "varying" for me, I might be tempted to try:
> head(reshapeasy( x.wide, "long", id="surveyNum", sep="_" ))
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L], :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘1.1’
But that's not a very useful error. Perhaps including a custom error message might be useful for your final function.
Allowing the user to set vary to NULL
, as you have done in your present version of the function, also doesn't seem wise to me. This yields output like this:
> head(reshapeasy( x.wide, "long", id="surveyNum", NULL, sep="_" ))
surveyNum pio caremgmt prev price
1.1 1 2 2 1 2
2.1 2 2 1 2 1
3.1 3 1 1 2 2
4.1 4 2 2 1 5
5.1 5 1 1 1 3
6.1 6 1 2 2 4
The problem with this output is that if I needed to reshape back to wide, I can't do it easily. Thus, I think that retaining reshape's default option of generating a time
variable, but letting the user override that might be a useful feature.
Perhaps for those who are lazy and don't like to type the variable names, you can add the following to the head of your function:
if (is.numeric(id) == 1) {
id = colnames(data)[id]
} else if (is.numeric(id) == 0) {
id = id
}
if (is.numeric(vary) == 1) {
vary = colnames(data)[vary]
} else if (is.numeric(vary) == 0) {
vary = vary
}
Then, following with your examples, you can use the following shorthand:
reshapeasy(x.wide, direction="long", id=1, sep="_", vary="id")
reshapeasy(x.long, direction="wide", id=6, vary=1)
(I know, it might not be good practice since the code might be less readable or less easily understandable by someone later on, but it does happen frequently.)
Some initial thoughts:
I've always thought that the direction commands "wide" and "long" were a little fuzzy. Do they mean you want to convert the data to that format, or that the data is already in that format? It is something that you need to learn or look up. You can avoid that problem by having to separate functions reshapeToWide
and reshapeToLong
. As a bonus, the signature of each function has one less argument.
I don't think you meant to include the line
varying <- which(!(colnames(x.wide) %in% "surveyNum"))
since it refers to a specific dataset.
I prefer data
to x
for the first argument since it makes it clear that the input should be a data frame.
It is generally better form to have arguments without defaults first. So vars
should come after id
and vary
.
Can you pick defaults for id
and vary
? reshape::melt
defaults to factor and character columns for id and numeric columns for vary.
I would also like to see an option to order the output, since that's one of the things I don't like about reshape in base R. As an example, let's use the Stata Learning Module: Reshaping data wide to long, which you are already familiar with. The example I'm looking at is the "kids height and weight at age 1 and age 2" example.
Here's what I normally do with reshape()
:
# library(foreign)
kidshtwt = read.dta("http://www.ats.ucla.edu/stat/stata/modules/kidshtwt.dta")
kidshtwt.l = reshape(kidshtwt, direction="long", idvar=1:2,
varying=3:6, sep="", timevar="age")
# The reshaped data is correct, just not in the order I want it
# so I always have to do another step like this
kidshtwt.l = kidshtwt.l[order(kidshtwt.l$famid, kidshtwt.l$birth),]
Since this is an annoying step that I always have to go through when reshaping the data, I think it would be useful to add that into your function.
I also suggest at least having an option for doing the same thing with the final column order for reshaping from long
to wide
.
I'm not sure of the best way to integrate this into your function, but I put this together to sort a data frame based on basic patterns for the variable names.
col.name.sort = function(data, patterns) {
a = names(data)
b = length(patterns)
subs = vector("list", b)
for (i in 1:b) {
subs[[i]] = sort(grep(patterns[i], a, value=T))
}
x = unlist(subs)
data[ , x ]
}
It can be used in the following manner. Imagine we had saved the output of your reshapeasy
long
to wide
example as a data frame named a
, and we wanted it ordered by "surveyNum", "caremgmt" (1-3), "prev" (1-3), "pio" (1-3), and "price" (1-3), we could use:
col.name.sort(a, c("sur", "car", "pre", "pio", "pri"))