R/tidyverse: calculating standard deviation across rows

后端未结

关注

 6  1081

Say I have the following data:

colA <- c(\"SampA\", \"SampB\", \"SampC\")
colB <- c(21, 20, 30)
colC <- c(15, 14, 12)
colD <- c(10, 22, 18)
df &l


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2020-12-19 06:23
              
            
            
                                                                       
Package magrittr pipes %>% are not a good way to process by rows.

Maybe the following is what you want.

df %>% 
  select(-colA) %>%
  t() %>% as.data.frame() %>%
  summarise_all(sd)
#        V1       V2       V3
#1 5.507571 4.163332 9.165151

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野的像风        
                
              
                            
                2020-12-19 06:30
              
            
            
                                                                       
Here is another way using pmap to get the rowwise mean and sd

library(purrr)
library(dplyr)
library(tidur_
f1 <- function(x) tibble(Mean = mean(x), SD = sd(x))
df %>% 
  # select the numeric columns
  select_if(is.numeric) %>%
  # apply the f1 rowwise to get the mean and sd in transmute
  transmute(out = pmap(.,  ~ f1(c(...)))) %>% 
  # unnest the list column
  unnest %>%
  # bind with the original dataset
  bind_cols(df, .)
#   colA colB colC colD     Mean       SD
#1 SampA   21   15   10 15.33333 5.507571
#2 SampB   20   14   22 18.66667 4.163332
#3 SampC   30   12   18 20.00000 9.165151

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉梦人生        
                
              
                            
                2020-12-19 06:33
              
            
            
                                                                       
Try this (using), withrowSds from the matrixStats package,

library(dplyr)
library(matrixStats)

columns <- c('colB', 'colC', 'colD')

df %>% 
  mutate(Mean= rowMeans(.[columns]), stdev=rowSds(as.matrix(.[columns])))


Returns 

   colA colB colC colD     Mean    stdev
1 SampA   21   15   10 15.33333 5.507571
2 SampB   20   14   22 18.66667 4.163332
3 SampC   30   12   18 20.00000 9.165151


Your data

colA <- c("SampA", "SampB", "SampC")
colB <- c(21, 20, 30)
colC <- c(15, 14, 12)
colD <- c(10, 22, 18)
df <- data.frame(colA, colB, colC, colD)
df

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-12-19 06:42
              
            
            
                                                                       
A different tidyverse approach could be:

df %>%
 rowid_to_column() %>%
 gather(var, val, -c(colA, rowid)) %>%
 group_by(rowid) %>%
 summarise(rsds = sd(val)) %>%
 left_join(df %>%
            rowid_to_column(), by = c("rowid" = "rowid")) %>%
 select(-rowid)

   rsds colA   colB  colC  colD
  <dbl> <fct> <dbl> <dbl> <dbl>
1  5.51 SampA    21    15    10
2  4.16 SampB    20    14    22
3  9.17 SampC    30    12    18


Here it, first, creates a row ID. Second, it performs a wide-to-long data transformation, excluding the "colA" and row ID. Third, it groups by row ID and calculates the standard deviation. Finally, it joins it with the original df on row ID.

Or alternatively, using rowwise() and do():

 df %>% 
 rowwise() %>%
 do(data.frame(., rsds = sd(unlist(.[2:length(.)]))))

  colA   colB  colC  colD  rsds
* <fct> <dbl> <dbl> <dbl> <dbl>
1 SampA    21    15    10  5.51
2 SampB    20    14    22  4.16
3 SampC    30    12    18  9.17

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2020-12-19 06:45
              
            
            
                                                                       
You can use pmap, or rowwise (or group by colA) along with mutate :

library(tidyverse)
df %>% mutate(sd = pmap(.[-1], ~sd(c(...)))) # same as transform(df, sd = apply(df[-1],1,sd))
#>    colA colB colC colD       sd
#> 1 SampA   21   15   10 5.507571
#> 2 SampB   20   14   22 4.163332
#> 3 SampC   30   12   18 9.165151

df %>% rowwise() %>% mutate(sd = sd(c(colB,colC,colD)))
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 5
#>   colA   colB  colC  colD    sd
#>   <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 SampA    21    15    10  5.51
#> 2 SampB    20    14    22  4.16
#> 3 SampC    30    12    18  9.17

df %>% group_by(colA) %>% mutate(sd = sd(c(colB,colC,colD)))
#> # A tibble: 3 x 5
#> # Groups:   colA [3]
#>   colA   colB  colC  colD    sd
#>   <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 SampA    21    15    10  5.51
#> 2 SampB    20    14    22  4.16
#> 3 SampC    30    12    18  9.17

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-19 06:49
              
            
            
                                                                       
I see this post is a bit old, but there are some pretty complicated answers so I thought I'd suggest an easier (and faster) approach.

Calculating means of rows is trivial, just use rowMeans:

rowMeans(df[, c('colB', 'colC', 'colD')])


This is vectorised and very fast.

There is no 'rowSd' function, but it is not hard to write one. Here is my 'rowVars' that I use.

rowVars <- function(x, na.rm=F) {
    # Vectorised version of variance filter
    rowSums((x - rowMeans(x, na.rm=na.rm))^2, na.rm=na.rm) / (ncol(x) - 1)
}


To calculate sd:

sqrt(rowVars(df[, c('colB', 'colC', 'colD')]))


Again, vectorised and fast which can be important if the input matrix is large.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复