Reshape in R without aggregation (for example MTurk response strings)

后端 未结 4 688
旧巷少年郎
旧巷少年郎 2021-01-21 09:16

Ordinarily, I\'d use a pretty basic long-to-wide reshape for this, but it seems to be dropping my aggregation variables. The setup is I had a job on mechanical Turk that I perfo

相关标签:
4条回答
  • 2021-01-21 09:57

    Using plyr:

    res = ddply(dat,.(Input.id,Input.State),
                function(x)unlist(as.character(x$Answer.Q1thing)))
    setNames(res,c('Id','State','Answer1','Answer2','Answer3'))
      Id State Answer1 Answer2  Answer3
    1 134231    NY Myguess Myguess BadGuess
    2 134812    CA Another Another  Another
    

    EDIT

    In case you have fewer than 3 answers:

    res = ddply(dat,.(Input.id,Input.State),
                function(x)
                  {
                  xx= unlist(as.character(x$Answer.Q1thing))
                  if(length(xx)==3)xx
                  else c(xx,rep(NA,3-length(xx)))
                })
    
    0 讨论(0)
  • 2021-01-21 10:02

    If your data is in a data.table it's a one-liner can be done as follows:

    library(data.table)    
    mturk.dt <- as.data.table(mturk)
    
    mturk.dt[, as.list(
             rbind(c(Answer.Q1thing, AssignmentStatus))
             )
            , by=list(Id=Input.id, State=Input.State)]
    

    Note that the by argument handles the name-changing too!


    If you want to properly name the other columns, use setnames after the fact or, more dynamically, using setattr within the j=.. argument as follows:

    After the Fact:

    ## Assuming 'res' is the reshaped data.table form above:
    ## Change the names of the six V1, V2.. columns 
    setnames(res, paste0("V", 1:6), c(paste0("Answer", 1:3), paste0("Status", 1:3)))
    

    Dynamically, in j=..

    ## Use `as.data.table` instead of `as.list`, to preserve new names
    mturk.dt[, as.data.table(
             rbind(c(
                  setattr(Answer.Q1thing,   "names", paste0("Answer", seq(Answer.Q1thing  )))
                , setattr(AssignmentStatus, "names", paste0("Status", seq(AssignmentStatus)))
                ))
             )
            , by=list(Id=Input.id, State=Input.State)]
    
           Id State Answer1 Answer2  Answer3  Status1  Status2  Status3
    1: 134231    NY Myguess Myguess BadGuess Approved Approved Approved
    2: 134812    CA Myguess Myguess BadGuess Approved Approved Approved
    
    0 讨论(0)
  • 2021-01-21 10:08

    That was super helpful, @Ricardo and @agstudy. I realized my reshape wasn't working only because it demanded a unique, categorical "timevar". I think in most cases, you do have a categorical label/factor like this that makes it easy, but it is not much harder to count them, and make the count into a label.

    I had a second problem was that my number of answers were not consistent; you both gave good help for that, but I was also able to just generate a counter and then implemented my original long-to-wide.

    Where the count was < 3, that is where an Input.id only had 2 Answers, I got NAs for this, which is what I wanted.

    So altogether:

    mturk$idx <- with(mturk, ave(Input.id, Input.id, FUN=seq_along)) # weird!
    dat <- reshape(mturk, timevar="idx", idvar=c("Input.id", "Input.state"), direction="wide")
    

    I used the syntax for counting sequences within a group that I found here. This was a little idiosyncratic in the use of the ave() function, but seems to crop up in a couple other answers. Tried rtl, too, but had no luck. Using ave(x,x,seq_along) seems to mostly be a hack to avoid sorting. It's odd to use this work-around for sequences in groups because clearly both count() and rtl() are effectively creating this sequencing under the hood in a temp variable.

    I like the way data.table allows this sequencing better.

    0 讨论(0)
  • 2021-01-21 10:15

    From data.table v1.9.5+, dcast can handle multiple value.var columns, i.e., we can cast multiple columns simultaneously. We can simply do:

    dt[, id := seq_len(.N), by=Input.id]
    dcast(dt, Input.id + Input.State ~ id, 
            value.var=c("AssignmentStatus", "Answer.Q1thing"))
    #    Input.id Input.State 1_AssignmentStatus 2_AssignmentStatus 3_AssignmentStatus
    # 1:   134231          NY           Approved           Approved           Approved
    # 2:   134812          CA          Submitted           Approved           Approved
    #    1_Answer.Q1thing 2_Answer.Q1thing 3_Answer.Q1thing
    # 1:          Myguess          Myguess         BadGuess
    # 2:          Another          Another          Another
    

    Or everything together in one line:

    dcast(dt, Input.id + Input.State ~ dt[, seq_len(.N), by=Input.id]$V1, 
                     value.var=c("AssignmentStatus", "Answer.Q1thing"))
    
    0 讨论(0)
提交回复
热议问题