问题
I don't know if using dcast()
is the right way, but I want to reshape the following data.frame:
df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c(14,14,16))
df
x y z
1 p1 a 14
2 p1 b 14
3 p2 a 16
so that it looks like this one:
df2 <- data.frame(x=c("p1","p2"),a=c(1,1),b=c(1,0),z=c(14,16))
x a b z
1 p1 1 1 14
2 p2 1 0 16
The variable y
in df
should be broken so that its elements are new variables, each dummy coded. All other variables (in this case just z
) are equal for each person (p1,p2 etc.). The only variable where a specific person p has different values is y
.
The reason I want this is because I need to merge this dataset with other ones by variable x
. Thing is, it needs to be one row per person (p1,p2 etc).
回答1:
This is almost a duplicate of a previous question, and the same basic answer I used there works again. No need for any external packages either.
aggregate(model.matrix(~ y - 1, data=df),df[c("x","z")],max)
x z ya yb
1 p1 14 1 1
2 p2 16 1 0
To explain this, as it is a bit odd looking, the model.matrix
call at its most basic returns a binary indicator variable for each unique value for each row of your data.frame, like so:
ya yb
1 1 0
2 0 1
3 1 0
If you aggregate
that intermediate result by your two id variables (x
and z
), you are then essentially acting on the initial data.frame of:
x z ya yb
1 p1 14 1 0
2 p1 14 0 1
3 p2 16 1 0
So if you take the max
value of ya
and yb
within each combination of x
and z
, you basically do:
x z ya yb
1 p1 14 1*max* 0
2 p1 14 0 1*max*
--collapse--
x z ya yb
1 p1 14 1 1
...and repeat that for each unique x
/z
combination to give the final result:
x z ya yb
1 p1 14 1 1
2 p2 16 1 0
Things get a bit crazy to generalise this to more columns, but it can be done, courtesy of this question e.g.:
df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c("14","15","16"))
intm <- model.matrix(~ y + z - 1, data=df,
contrasts.arg = sapply(df[2:3], contrasts, contrasts=FALSE))
aggregate(intm,df[c("x")],max)
x ya yb z14 z15 z16
1 p1 1 1 1 1 0
2 p2 1 0 0 0 1
回答2:
The following works, but seems cumbersome.
df2 <- df
df2$y <- as.numeric(y)
df$y2 <- as.numeric(df$y)
df2 <- dcast(df, x+z~y, value.var="y2")
df2
x z a b
1 p1 14 1 2
2 p2 16 1 NA
回答3:
I'm not sure much of this you have to do but if you need a way to automate it, I wrote this little function that might help:
First run dcast:
new = dcast(df, x+z~y, value.var="y")
Load into your R environment:
# args to be passed:
# df is your dataframe
# cols is a list of format c("colname1", "colname2", ... , "colnameN")
binarizeCols = function(df, cols){
for(i in cols){
column = which(colnames(df) == i)
truthRow = is.na(df[,column])
for(j in 1:length(truthRow)){
if(truthRow[j] == FALSE){
df[j,column] = 1
}else{
df[j,column] = 0
}
}
}
return(df)
}
then run:
new = binarizeCols(new, c("a", "b"))
and you get:
x z a b
1 p1 14 1 1
2 p2 16 1 0
not as fast as using _apply() but there's no hardcoding, you can enter any colnames you want (maybe you want to skip one in the middle?) and you dont create a new instance of your df. note: I use "=" instead of "<-" because I thought it was being phased out but they can be replaced if need be.
回答4:
df <- data.frame(x=c("p1","p1","p2","p3"),
y=c("a","b","a","c"),
z=c(14,14,16,17)) # wanted larger test case.
new <- dcast(df, x+z~y, value.var="y")
new[3:5] <- sapply(lapply(new[3:5], '%in%', unique(df$y) ), as.numeric)
new
x z a b c
1 p1 14 1 1 0
2 p2 16 1 0 0
3 p3 17 0 0 1
First check for containment in a vector that summarizes the possible values to create columns of logical values. Then 'dummify' by taking as.numeric of those logical values.
来源:https://stackoverflow.com/questions/18113443/reshape-data-using-dcast