Transform from Wide to Long without sorting columns

后端未结

关注

 5  1607

I want to convert a dataframe from wide format to long format.

Here it is a toy example:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-12-21 13:29
              
            
            
                                                                       
The OP has updated his answer to his own question complaining about the memory consumption of the intermediate melt() step when half of the columns are id.vars. He requested that data.table needs a direct way to do it without creating giant middle steps.

Well, data.table already does have that ability, it's called join.

Given the sample data from the Q, the whole operation can be implemented in a less memory consuming way by reshaping with only one id.var and later joining the reshaped result with the original data.table:

setDT(mydata)

# add unique row number to join on later 
# (leave `ID` col as placeholder for all other id.vars)
mydata[, rn := seq_len(.N)]

# define columns to be reshaped
measure_cols <- stringr::str_subset(names(mydata), "_\\d$")

# melt with only one id.vars column
molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)

# split column names of measure.vars
# Note that "variable" is reused to save memory 
molten[, c("variable", "measure") := tstrsplit(variable, "_")]

# coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]

# remove columns no longer needed in mydata _before_ joining to save memory
mydata[, (measure_cols) := NULL]

# final dcast and right join
result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
result
#    ID rn measure ZA BB CC
# 1:  1  1       1  1  3 NA
# 2:  1  1       2  5  6 NA
# 3:  1  1       7 NA NA  6
# 4:  2  2       1  2  3 NA
# 5:  2  2       2  4  6 NA
# 6:  2  2       7 NA NA  5
# 7:  3  3       1  3  3 NA
# 8:  3  3       2  3  6 NA
# 9:  3  3       7 NA NA  4
#10:  4  4       1  4  3 NA
#11:  4  4       2  2  6 NA
#12:  4  4       7 NA NA  3
#13:  5  5       1  5  3 NA
#14:  5  5       2  1  6 NA
#15:  5  5       7 NA NA  2


Finally, you may remove the row number if no longer needed by result[, rn := NULL].

Furthermore, you can remove the intermediate molten by rm(molten).

We have started with a data.table consisting of 1 id column, 5 measure cols and 5 rows. The reshaped result has 1 id column, 3 measure cols, and 15 rows. So, the data volume stored in id columns effectively has tripled. However, the intermediate step needed only one id.var rn.

EDIT If memory consumption is crucial, it might be worthwhile to consider to keep the id.vars and the measure.vars in two separate data.tables and to join only the necessary id.var columns with the measure.vars on demand.

Note that the measure.vars parameter to melt()allows for a special function patterns(). With this the call to melt() could have been written as well as

molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\\d$"))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2020-12-21 13:34
              
            
            
                                                                       
Here is a method using base R functions split.default and do.call.

# split the non-ID variables into groups based on their name suffix
myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))

# append variables by row after setting the regularizing variable names, cbind ID
cbind(mydata[1],
      do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
    ID ZA BB
1.1  1  1  3
1.2  2  2  3
1.3  3  3  3
1.4  4  4  3
1.5  5  5  3
2.1  1  5  6
2.2  2  4  6
2.3  3  3  6
2.4  4  2  6
2.5  5  1  6


The first line splits the data.frame variables (minus ID) into lists that agree on the final character of their variable name. This criterion is determined using gsub. The second line uses do.call to call rbind on this list of variables, modified with setNames so that the final digit and underscore are removed from their names. Finally, cbind attaches the ID to the resulting data.frame.

Note that the data has to be structured regularly, with no missing variables, etc.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  故里飘歌        
                
              
                            
                2020-12-21 13:34
              
            
            
                                                                       
Finally I've found the way, modifying my initial solution

mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)

idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
measure = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )


And it gives you the proper measure values.
Anyway this solution needs a lot of memory.

The trick was converting the var variable to factor specifying the order I want with levels, as mtoto did.
mtoto solution is nice because it doesn't need to cast and melt, only melt, but doesn't work in my updated example, only works when there are the same number of number variations for each word.

PD:
I've being parsing every step and found that the melt step could be a big problem when working with large datatables. If you have a data.table with just 100000 rows x 1000 columns and use half of the columns as id.vars the output is approx 50000000 x 500, just too much to continue with the next step.
data.table needs a direct way to do it without creating giant middle steps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-12-21 13:49
              
            
            
                                                                       
An alternative approach with data.table:

melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
                        ][, variable := factor(variable, levels = unique(variable))
                          ][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]


which gives:


    ID measure ZA BB CC
 1:  1       1  1  3 NA
 2:  1       2  5  6 NA
 3:  1       7 NA NA  6
 4:  2       1  2  3 NA
 5:  2       2  4  6 NA
 6:  2       7 NA NA  5
 7:  3       1  3  3 NA
 8:  3       2  3  6 NA
 9:  3       7 NA NA  4
10:  4       1  4  3 NA
11:  4       2  2  6 NA
12:  4       7 NA NA  3
13:  5       1  5  3 NA
14:  5       2  1  6 NA
15:  5       7 NA NA  2


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-12-21 13:54
              
            
            
                                                                       
You can melt several columns simultaneously, if you pass a list of column names to the argument measure =. One approach to do this in a scalable manner would be to:


Extract the column names and the corresponding first two letters:

measurevars <- names(mydata)[grepl("_[1-9]$",names(mydata))]
groups <- gsub("_[1-9]$","",measurevars)

Turn groups into a factor object and make sure levels aren't ordered alphabetically. We'll use this in the next step to create a list object with the correct structure.

split_on <- factor(groups, levels = unique(groups))

Create a list using measurevars with split(), and create vector for the value.name = argument in melt().

measure_list <- split(measurevars, split_on)
measurenames <- unique(groups)



Bringing it all together:

melt(setDT(mydata), 
     measure = measure_list, 
     value.name = measurenames,
     variable.name = "measure")
#    ID measure ZA BB
# 1:  1       1  1  3
# 2:  2       1  2  3
# 3:  3       1  3  3
# 4:  4       1  4  3
# 5:  5       1  5  3
# 6:  1       2  5  6
# 7:  2       2  4  6
# 8:  3       2  3  6
# 9:  4       2  2  6
#10:  5       2  1  6

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复