Reshape data using dcast?

问题

I don't know if using dcast() is the right way, but I want to reshape the following data.frame:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c(14,14,16))
df
   x y  z
1 p1 a 14
2 p1 b 14
3 p2 a 16

so that it looks like this one:

df2 <- data.frame(x=c("p1","p2"),a=c(1,1),b=c(1,0),z=c(14,16))
   x a b  z
1 p1 1 1 14
2 p2 1 0 16

The variable y in df should be broken so that its elements are new variables, each dummy coded. All other variables (in this case just z) are equal for each person (p1,p2 etc.). The only variable where a specific person p has different values is y.
The reason I want this is because I need to merge this dataset with other ones by variable x. Thing is, it needs to be one row per person (p1,p2 etc).

回答1:

This is almost a duplicate of a previous question, and the same basic answer I used there works again. No need for any external packages either.

aggregate(model.matrix(~ y - 1, data=df),df[c("x","z")],max)

   x  z ya yb
1 p1 14  1  1
2 p2 16  1  0

To explain this, as it is a bit odd looking, the model.matrix call at its most basic returns a binary indicator variable for each unique value for each row of your data.frame, like so:

If you aggregate that intermediate result by your two id variables (x and z), you are then essentially acting on the initial data.frame of:

   x  z ya yb
1 p1 14  1  0
2 p1 14  0  1
3 p2 16  1  0

So if you take the max value of ya and yb within each combination of x and z, you basically do:

   x  z ya      yb
1 p1 14  1*max*  0
2 p1 14  0       1*max*

--collapse--

   x  z ya      yb
1 p1 14  1       1

...and repeat that for each unique x/z combination to give the final result:

   x  z ya yb
1 p1 14  1  1
2 p2 16  1  0

Things get a bit crazy to generalise this to more columns, but it can be done, courtesy of this question e.g.:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c("14","15","16"))
intm <- model.matrix(~ y + z - 1, data=df,
                 contrasts.arg = sapply(df[2:3], contrasts, contrasts=FALSE))
aggregate(intm,df[c("x")],max)

   x ya yb z14 z15 z16
1 p1  1  1   1   1   0
2 p2  1  0   0   0   1

回答2:

The following works, but seems cumbersome.

df2 <- df
df2$y <- as.numeric(y)
df$y2 <- as.numeric(df$y)

df2 <- dcast(df, x+z~y, value.var="y2")

df2
   x  z a  b
1 p1 14 1  2
2 p2 16 1 NA

回答3:

I'm not sure much of this you have to do but if you need a way to automate it, I wrote this little function that might help:

First run dcast:

new = dcast(df, x+z~y, value.var="y")

Load into your R environment:

 # args to be passed: 
 # df is your dataframe 
 # cols is a list of format c("colname1", "colname2", ... , "colnameN")
    binarizeCols = function(df, cols){
      for(i in cols){
        column = which(colnames(df) == i)
        truthRow = is.na(df[,column])
        for(j in 1:length(truthRow)){
          if(truthRow[j] == FALSE){
            df[j,column] = 1
          }else{
             df[j,column] = 0
           }
        }
      }
      return(df)
    }

then run:

new = binarizeCols(new, c("a", "b"))

and you get:

     x  z  a  b
   1 p1 14 1  1 
   2 p2 16 1  0

not as fast as using _apply() but there's no hardcoding, you can enter any colnames you want (maybe you want to skip one in the middle?) and you dont create a new instance of your df. note: I use "=" instead of "<-" because I thought it was being phased out but they can be replaced if need be.

回答4:

df <- data.frame(x=c("p1","p1","p2","p3"),
                 y=c("a","b","a","c"),
                 z=c(14,14,16,17))  # wanted larger test case.
new <- dcast(df, x+z~y, value.var="y")
new[3:5] <- sapply(lapply(new[3:5], '%in%', unique(df$y) ), as.numeric)
new
   x  z a b c
1 p1 14 1 1 0
2 p2 16 1 0 0
3 p3 17 0 0 1

First check for containment in a vector that summarizes the possible values to create columns of logical values. Then 'dummify' by taking as.numeric of those logical values.

来源：https://stackoverflow.com/questions/18113443/reshape-data-using-dcast

标签

reshape

reshape2