Using shapiro.test on multiple columns in a data frame

前端 未结 3 1993
伪装坚强ぢ
伪装坚强ぢ 2020-12-28 10:08

It seems like a pretty simple question, but I can\'t find the answer.

I have a dataframe (lets call it df), containing n=100 columns (C1, <

相关标签:
3条回答
  • 2020-12-28 10:37

    To apply some function over rows or columns of a data frame, one uses apply family:

    df <- data.frame(a=rnorm(100), b=rnorm(100))    
    df.shapiro <- apply(df, 2, shapiro.test)
    df.shapiro
    $a
    
        Shapiro-Wilk normality test
    
    data:  newX[, i]
    W = 0.9895, p-value = 0.6276
    
    
    $b
    
        Shapiro-Wilk normality test
    
    data:  newX[, i]
    W = 0.9854, p-value = 0.3371
    

    Note that column names are preserved, and df.shapiro is a named list.

    Now, if you want, say, a vector of p-values, all you have to do is to extract them from appropriate lists:

    unlist(lapply(df.shapiro, function(x) x$p.value))
            a         b 
    0.6275521 0.3370931 
    
    0 讨论(0)
  • 2020-12-28 10:39

    Use do.call with rbind and lapply for more simple and compact solution:

    df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
    do.call(rbind, lapply(df, function(x) shapiro.test(x)[c("statistic", "p.value")]))
    #>   statistic p.value    
    #> a 0.986224  0.3875904  
    #> b 0.9894938 0.6238027
    #> c 0.9652532 0.009694794
    
    0 讨论(0)
  • 2020-12-28 10:42

    Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of sapply() or lapply() (or even apply(), but for data frames, one of the two earlier-mentioned functions would be best).

    Here is an example, using some dummy data:

    set.seed(42)
    df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2), 
                     Uniform = runif(50))
    

    Now apply the shapiro.test() function. We capture the output in a list (given the object returned by this function) so we will use lapply().

    lshap <- lapply(df, shapiro.test)
    lshap[[1]] ## look at the first column results
    
    R> lshap[[1]]
    
        Shapiro-Wilk normality test
    
    data:  X[[1L]]
    W = 0.9802, p-value = 0.5611
    

    You will need to extract the things you want from these objects, which all have the structure:

    R> str(lshap[[1]])
    List of 4
     $ statistic: Named num 0.98
      ..- attr(*, "names")= chr "W"
     $ p.value  : num 0.561
     $ method   : chr "Shapiro-Wilk normality test"
     $ data.name: chr "X[[1L]]"
     - attr(*, "class")= chr "htest"
    

    If you want the statistic and p.value components of this object for all elements of lshap, we will use sapply() this time, to nicely arrange the results for us:

    lres <- sapply(lshap, `[`, c("statistic","p.value"))
    
    R> lres
              Gaussian Poisson Uniform 
    statistic 0.9802   0.9371  0.918   
    p.value   0.5611   0.01034 0.001998
    

    Given that you have 500 of these, I'd transpose lres:

    R> t(lres)
             statistic p.value 
    Gaussian 0.9802    0.5611  
    Poisson  0.9371    0.01034 
    Uniform  0.918     0.001998
    

    If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.

    0 讨论(0)
提交回复
热议问题