问题
Ordinarily, I'd use a pretty basic long-to-wide reshape for this, but it seems to be dropping my aggregation variables. The setup is I had a job on mechanical Turk that I performed in triplicate---I want MTurk1, Mturk2, MTurk3's answers to be their own variables in the data frame but uniquely id'd by a field I input with the job, so that I can compare them against each other with a function later.
Current Format:
> head(mturk)
AssignmentStatus Input.id Input.State Answer.Q1thing
1 Approved 134231 NY Myguess
2 Approved 134231 NY Myguess
3 Approved 134231 NY BadGuess
4 Submitted 134812 CA Another
5 Approved 134812 CA Another
6 Approved 134812 CA Another
I'd like this to become
Input.id Input.State Answer.Q1thing.1 Answer.Q1thing.2 Answer.Q1thing.3 AssignmentStatus.1 AssignmentStatus.2 AssignmentStatus.3
134231 NY Myguess Myguess BadGuess Approved Approved Approved
134812 CA Another Another Another Submitted Approved Approved
or ideally, if there's a variable that can redo column names in the operation....
Id State Answer1 Answer2 Answer3 Status1 Status2 Status3
134231 NY Myguess Myguess BadGuess Approved Approved Approved
134812 CA Another Another Another Submitted Approved Approved
dat <- reshape(mturk, timevar="Answer.Q1thing", idvar=c("Input.id", "Input.state"), direction="wide")
This seems to be failing because most reshape long-to-wide functions expect that the variable that becomes wide itself a categorical text-field---that is, this is not a long-to-wide reshape operation because I don't want a variable named "MyGuess" "BadGuess" and "Another", but I want a generic "Answer.X" variable containing these values. I'm not trying to aggregate in any way, such as mean or sum, just list the value in a new place.
So, two directions for this question:
- Does this kind of operation have another name? Is this an unfold, unpivot, uncast or something?
- How to do this in R?
回答1:
If your data is in a data.table
it's a one-liner can be done as follows:
library(data.table)
mturk.dt <- as.data.table(mturk)
mturk.dt[, as.list(
rbind(c(Answer.Q1thing, AssignmentStatus))
)
, by=list(Id=Input.id, State=Input.State)]
Note that the by
argument handles the name-changing too!
If you want to properly name the other columns, use setnames
after the fact or, more dynamically, using setattr
within the j=..
argument as follows:
After the Fact:
## Assuming 'res' is the reshaped data.table form above:
## Change the names of the six V1, V2.. columns
setnames(res, paste0("V", 1:6), c(paste0("Answer", 1:3), paste0("Status", 1:3)))
Dynamically, in j=..
## Use `as.data.table` instead of `as.list`, to preserve new names
mturk.dt[, as.data.table(
rbind(c(
setattr(Answer.Q1thing, "names", paste0("Answer", seq(Answer.Q1thing )))
, setattr(AssignmentStatus, "names", paste0("Status", seq(AssignmentStatus)))
))
)
, by=list(Id=Input.id, State=Input.State)]
Id State Answer1 Answer2 Answer3 Status1 Status2 Status3
1: 134231 NY Myguess Myguess BadGuess Approved Approved Approved
2: 134812 CA Myguess Myguess BadGuess Approved Approved Approved
回答2:
Using plyr
:
res = ddply(dat,.(Input.id,Input.State),
function(x)unlist(as.character(x$Answer.Q1thing)))
setNames(res,c('Id','State','Answer1','Answer2','Answer3'))
Id State Answer1 Answer2 Answer3
1 134231 NY Myguess Myguess BadGuess
2 134812 CA Another Another Another
EDIT
In case you have fewer than 3 answers:
res = ddply(dat,.(Input.id,Input.State),
function(x)
{
xx= unlist(as.character(x$Answer.Q1thing))
if(length(xx)==3)xx
else c(xx,rep(NA,3-length(xx)))
})
回答3:
That was super helpful, @Ricardo and @agstudy. I realized my reshape wasn't working only because it demanded a unique, categorical "timevar". I think in most cases, you do have a categorical label/factor like this that makes it easy, but it is not much harder to count them, and make the count into a label.
I had a second problem was that my number of answers were not consistent; you both gave good help for that, but I was also able to just generate a counter and then implemented my original long-to-wide.
Where the count was < 3, that is where an Input.id only had 2 Answers, I got NAs for this, which is what I wanted.
So altogether:
mturk$idx <- with(mturk, ave(Input.id, Input.id, FUN=seq_along)) # weird!
dat <- reshape(mturk, timevar="idx", idvar=c("Input.id", "Input.state"), direction="wide")
I used the syntax for counting sequences within a group that I found here. This was a little idiosyncratic in the use of the ave()
function, but seems to crop up in a couple other answers. Tried rtl
, too, but had no luck. Using ave(x,x,seq_along) seems to mostly be a hack to avoid sorting. It's odd to use this work-around for sequences in groups because clearly both count() and rtl() are effectively creating this sequencing under the hood in a temp variable.
I like the way data.table allows this sequencing better.
回答4:
From data.table v1.9.5+, dcast
can handle multiple value.var
columns, i.e., we can cast multiple columns simultaneously. We can simply do:
dt[, id := seq_len(.N), by=Input.id]
dcast(dt, Input.id + Input.State ~ id,
value.var=c("AssignmentStatus", "Answer.Q1thing"))
# Input.id Input.State 1_AssignmentStatus 2_AssignmentStatus 3_AssignmentStatus
# 1: 134231 NY Approved Approved Approved
# 2: 134812 CA Submitted Approved Approved
# 1_Answer.Q1thing 2_Answer.Q1thing 3_Answer.Q1thing
# 1: Myguess Myguess BadGuess
# 2: Another Another Another
Or everything together in one line:
dcast(dt, Input.id + Input.State ~ dt[, seq_len(.N), by=Input.id]$V1,
value.var=c("AssignmentStatus", "Answer.Q1thing"))
来源:https://stackoverflow.com/questions/21611040/reshape-in-r-without-aggregation-for-example-mturk-response-strings