Transform from Wide to Long without sorting columns

后端 未结 5 1604
清酒与你
清酒与你 2020-12-21 13:24

I want to convert a dataframe from wide format to long format.

Here it is a toy example:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=         


        
相关标签:
5条回答
  • 2020-12-21 13:29

    The OP has updated his answer to his own question complaining about the memory consumption of the intermediate melt() step when half of the columns are id.vars. He requested that data.table needs a direct way to do it without creating giant middle steps.

    Well, data.table already does have that ability, it's called join.

    Given the sample data from the Q, the whole operation can be implemented in a less memory consuming way by reshaping with only one id.var and later joining the reshaped result with the original data.table:

    setDT(mydata)
    
    # add unique row number to join on later 
    # (leave `ID` col as placeholder for all other id.vars)
    mydata[, rn := seq_len(.N)]
    
    # define columns to be reshaped
    measure_cols <- stringr::str_subset(names(mydata), "_\\d$")
    
    # melt with only one id.vars column
    molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)
    
    # split column names of measure.vars
    # Note that "variable" is reused to save memory 
    molten[, c("variable", "measure") := tstrsplit(variable, "_")]
    
    # coerce names to factors in the same order as the columns appeared in mydata
    molten[, variable := forcats::fct_inorder(variable)]
    
    # remove columns no longer needed in mydata _before_ joining to save memory
    mydata[, (measure_cols) := NULL]
    
    # final dcast and right join
    result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
    result
    #    ID rn measure ZA BB CC
    # 1:  1  1       1  1  3 NA
    # 2:  1  1       2  5  6 NA
    # 3:  1  1       7 NA NA  6
    # 4:  2  2       1  2  3 NA
    # 5:  2  2       2  4  6 NA
    # 6:  2  2       7 NA NA  5
    # 7:  3  3       1  3  3 NA
    # 8:  3  3       2  3  6 NA
    # 9:  3  3       7 NA NA  4
    #10:  4  4       1  4  3 NA
    #11:  4  4       2  2  6 NA
    #12:  4  4       7 NA NA  3
    #13:  5  5       1  5  3 NA
    #14:  5  5       2  1  6 NA
    #15:  5  5       7 NA NA  2
    

    Finally, you may remove the row number if no longer needed by result[, rn := NULL].

    Furthermore, you can remove the intermediate molten by rm(molten).

    We have started with a data.table consisting of 1 id column, 5 measure cols and 5 rows. The reshaped result has 1 id column, 3 measure cols, and 15 rows. So, the data volume stored in id columns effectively has tripled. However, the intermediate step needed only one id.var rn.

    EDIT If memory consumption is crucial, it might be worthwhile to consider to keep the id.vars and the measure.vars in two separate data.tables and to join only the necessary id.var columns with the measure.vars on demand.

    Note that the measure.vars parameter to melt()allows for a special function patterns(). With this the call to melt() could have been written as well as

    molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\\d$"))
    
    0 讨论(0)
  • 2020-12-21 13:34

    Here is a method using base R functions split.default and do.call.

    # split the non-ID variables into groups based on their name suffix
    myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))
    
    # append variables by row after setting the regularizing variable names, cbind ID
    cbind(mydata[1],
          do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
        ID ZA BB
    1.1  1  1  3
    1.2  2  2  3
    1.3  3  3  3
    1.4  4  4  3
    1.5  5  5  3
    2.1  1  5  6
    2.2  2  4  6
    2.3  3  3  6
    2.4  4  2  6
    2.5  5  1  6
    

    The first line splits the data.frame variables (minus ID) into lists that agree on the final character of their variable name. This criterion is determined using gsub. The second line uses do.call to call rbind on this list of variables, modified with setNames so that the final digit and underscore are removed from their names. Finally, cbind attaches the ID to the resulting data.frame.

    Note that the data has to be structured regularly, with no missing variables, etc.

    0 讨论(0)
  • 2020-12-21 13:34

    Finally I've found the way, modifying my initial solution

    mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
    BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)
    
    idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
    temp <- melt(mydata, id.vars = idvars)  
    temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
    measure = sub('.*_', '', variable), variable = NULL)]  
    temp[,var:=factor(var, levels=unique(var))]
    dcast( temp,   ... ~ var, value.var='value' )
    

    And it gives you the proper measure values. Anyway this solution needs a lot of memory.

    The trick was converting the var variable to factor specifying the order I want with levels, as mtoto did. mtoto solution is nice because it doesn't need to cast and melt, only melt, but doesn't work in my updated example, only works when there are the same number of number variations for each word.

    PD: I've being parsing every step and found that the melt step could be a big problem when working with large datatables. If you have a data.table with just 100000 rows x 1000 columns and use half of the columns as id.vars the output is approx 50000000 x 500, just too much to continue with the next step. data.table needs a direct way to do it without creating giant middle steps.

    0 讨论(0)
  • 2020-12-21 13:49

    An alternative approach with data.table:

    melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
                            ][, variable := factor(variable, levels = unique(variable))
                              ][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]
    

    which gives:

        ID measure ZA BB CC
     1:  1       1  1  3 NA
     2:  1       2  5  6 NA
     3:  1       7 NA NA  6
     4:  2       1  2  3 NA
     5:  2       2  4  6 NA
     6:  2       7 NA NA  5
     7:  3       1  3  3 NA
     8:  3       2  3  6 NA
     9:  3       7 NA NA  4
    10:  4       1  4  3 NA
    11:  4       2  2  6 NA
    12:  4       7 NA NA  3
    13:  5       1  5  3 NA
    14:  5       2  1  6 NA
    15:  5       7 NA NA  2
    
    0 讨论(0)
  • 2020-12-21 13:54

    You can melt several columns simultaneously, if you pass a list of column names to the argument measure =. One approach to do this in a scalable manner would be to:

    1. Extract the column names and the corresponding first two letters:

      measurevars <- names(mydata)[grepl("_[1-9]$",names(mydata))]
      groups <- gsub("_[1-9]$","",measurevars)
      
    2. Turn groups into a factor object and make sure levels aren't ordered alphabetically. We'll use this in the next step to create a list object with the correct structure.

      split_on <- factor(groups, levels = unique(groups))
      
    3. Create a list using measurevars with split(), and create vector for the value.name = argument in melt().

      measure_list <- split(measurevars, split_on)
      measurenames <- unique(groups)
      

    Bringing it all together:

    melt(setDT(mydata), 
         measure = measure_list, 
         value.name = measurenames,
         variable.name = "measure")
    #    ID measure ZA BB
    # 1:  1       1  1  3
    # 2:  2       1  2  3
    # 3:  3       1  3  3
    # 4:  4       1  4  3
    # 5:  5       1  5  3
    # 6:  1       2  5  6
    # 7:  2       2  4  6
    # 8:  3       2  3  6
    # 9:  4       2  2  6
    #10:  5       2  1  6
    
    0 讨论(0)
提交回复
热议问题