R/tidyverse: calculating standard deviation across rows

后端 未结 6 1077
情深已故
情深已故 2020-12-19 05:53

Say I have the following data:

colA <- c(\"SampA\", \"SampB\", \"SampC\")
colB <- c(21, 20, 30)
colC <- c(15, 14, 12)
colD <- c(10, 22, 18)
df &l         


        
相关标签:
6条回答
  • 2020-12-19 06:23

    Package magrittr pipes %>% are not a good way to process by rows.
    Maybe the following is what you want.

    df %>% 
      select(-colA) %>%
      t() %>% as.data.frame() %>%
      summarise_all(sd)
    #        V1       V2       V3
    #1 5.507571 4.163332 9.165151
    
    0 讨论(0)
  • 2020-12-19 06:30

    Here is another way using pmap to get the rowwise mean and sd

    library(purrr)
    library(dplyr)
    library(tidur_
    f1 <- function(x) tibble(Mean = mean(x), SD = sd(x))
    df %>% 
      # select the numeric columns
      select_if(is.numeric) %>%
      # apply the f1 rowwise to get the mean and sd in transmute
      transmute(out = pmap(.,  ~ f1(c(...)))) %>% 
      # unnest the list column
      unnest %>%
      # bind with the original dataset
      bind_cols(df, .)
    #   colA colB colC colD     Mean       SD
    #1 SampA   21   15   10 15.33333 5.507571
    #2 SampB   20   14   22 18.66667 4.163332
    #3 SampC   30   12   18 20.00000 9.165151
    
    0 讨论(0)
  • 2020-12-19 06:33

    Try this (using), withrowSds from the matrixStats package,

    library(dplyr)
    library(matrixStats)
    
    columns <- c('colB', 'colC', 'colD')
    
    df %>% 
      mutate(Mean= rowMeans(.[columns]), stdev=rowSds(as.matrix(.[columns])))
    

    Returns

       colA colB colC colD     Mean    stdev
    1 SampA   21   15   10 15.33333 5.507571
    2 SampB   20   14   22 18.66667 4.163332
    3 SampC   30   12   18 20.00000 9.165151
    

    Your data

    colA <- c("SampA", "SampB", "SampC")
    colB <- c(21, 20, 30)
    colC <- c(15, 14, 12)
    colD <- c(10, 22, 18)
    df <- data.frame(colA, colB, colC, colD)
    df
    
    0 讨论(0)
  • 2020-12-19 06:42

    A different tidyverse approach could be:

    df %>%
     rowid_to_column() %>%
     gather(var, val, -c(colA, rowid)) %>%
     group_by(rowid) %>%
     summarise(rsds = sd(val)) %>%
     left_join(df %>%
                rowid_to_column(), by = c("rowid" = "rowid")) %>%
     select(-rowid)
    
       rsds colA   colB  colC  colD
      <dbl> <fct> <dbl> <dbl> <dbl>
    1  5.51 SampA    21    15    10
    2  4.16 SampB    20    14    22
    3  9.17 SampC    30    12    18
    

    Here it, first, creates a row ID. Second, it performs a wide-to-long data transformation, excluding the "colA" and row ID. Third, it groups by row ID and calculates the standard deviation. Finally, it joins it with the original df on row ID.

    Or alternatively, using rowwise() and do():

     df %>% 
     rowwise() %>%
     do(data.frame(., rsds = sd(unlist(.[2:length(.)]))))
    
      colA   colB  colC  colD  rsds
    * <fct> <dbl> <dbl> <dbl> <dbl>
    1 SampA    21    15    10  5.51
    2 SampB    20    14    22  4.16
    3 SampC    30    12    18  9.17
    
    0 讨论(0)
  • 2020-12-19 06:45

    You can use pmap, or rowwise (or group by colA) along with mutate :

    library(tidyverse)
    df %>% mutate(sd = pmap(.[-1], ~sd(c(...)))) # same as transform(df, sd = apply(df[-1],1,sd))
    #>    colA colB colC colD       sd
    #> 1 SampA   21   15   10 5.507571
    #> 2 SampB   20   14   22 4.163332
    #> 3 SampC   30   12   18 9.165151
    
    df %>% rowwise() %>% mutate(sd = sd(c(colB,colC,colD)))
    #> Source: local data frame [3 x 5]
    #> Groups: <by row>
    #> 
    #> # A tibble: 3 x 5
    #>   colA   colB  colC  colD    sd
    #>   <fct> <dbl> <dbl> <dbl> <dbl>
    #> 1 SampA    21    15    10  5.51
    #> 2 SampB    20    14    22  4.16
    #> 3 SampC    30    12    18  9.17
    
    df %>% group_by(colA) %>% mutate(sd = sd(c(colB,colC,colD)))
    #> # A tibble: 3 x 5
    #> # Groups:   colA [3]
    #>   colA   colB  colC  colD    sd
    #>   <fct> <dbl> <dbl> <dbl> <dbl>
    #> 1 SampA    21    15    10  5.51
    #> 2 SampB    20    14    22  4.16
    #> 3 SampC    30    12    18  9.17
    
    0 讨论(0)
  • 2020-12-19 06:49

    I see this post is a bit old, but there are some pretty complicated answers so I thought I'd suggest an easier (and faster) approach.

    Calculating means of rows is trivial, just use rowMeans:

    rowMeans(df[, c('colB', 'colC', 'colD')])
    

    This is vectorised and very fast.

    There is no 'rowSd' function, but it is not hard to write one. Here is my 'rowVars' that I use.

    rowVars <- function(x, na.rm=F) {
        # Vectorised version of variance filter
        rowSums((x - rowMeans(x, na.rm=na.rm))^2, na.rm=na.rm) / (ncol(x) - 1)
    }
    

    To calculate sd:

    sqrt(rowVars(df[, c('colB', 'colC', 'colD')]))
    

    Again, vectorised and fast which can be important if the input matrix is large.

    0 讨论(0)
提交回复
热议问题