问题
I was trying to get descriptive stat for my data. I went through many suggestions. However I just want to know if there is any package(s) to perform descriptive stats on the data format provided below.
head(mydata)
X A1 A2 A3 M1 M2 M3 U1 U2 U3
1 A A A M M M U U U
2 X1 100 200 250 200 230 400 400 100 200
3 X2 600 300 400 300 550 750 800 900 540
4 X3 500 300 200 200 200 100 500 400 600
The data has samples on the column and variables on rows. First row is samples name, second row is groups (A, M, U). I want to get descriptive statistics for each group. For example mean, sd.... for group A (A1, A2, A3). Could anyone please let me know how can I do this. I have seen most of the answers on descriptive stats and they are for columns. Please let me know if the question is not clear. Thank you for your help.
Higgs
回答1:
@Phil is dead right with his recommendation.
One of the key principles you'll learn in Hadley's book is the tidy data principle (very basically: variables in columns, individual observations in rows). If you want a quick introduction to tidy data, try this vignette.
There are multiple ways to go about fixing and analysing your data, but here is an example using tools from the 'tidyverse'.
# Load useful 'tidy data' packages
library(tidyverse)
# Make 'mydata'
mydata <- data_frame(X = c('', 'X1', 'X2', 'X3'),
A1 = c('A', 100, 600, 500),
A2 = c('A', 200, 300, 300),
A3 = c('A', 250, 400, 200),
M1 = c('M', 200, 300, 200),
M2 = c('M', 230, 550, 200),
M3 = c('M', 400, 750, 100),
U1 = c('U', 400, 800, 500),
U2 = c('U', 100, 900, 400),
U3 = c('U', 200, 540, 600))
# View 'mydata'
mydata
#> # A tibble: 4 x 10
#> X A1 A2 A3 M1 M2 M3 U1 U2 U3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "" A A A M M M U U U
#> 2 X1 100 200 250 200 230 400 400 100 200
#> 3 X2 600 300 400 300 550 750 800 900 540
#> 4 X3 500 300 200 200 200 100 500 400 600
Convert to a tidy dataframe
# Transpose rows and columns and convert resulting matrix back into a dataframe
mydata_new <- as_data_frame(t(mydata))
# View 'mydata_new'
mydata_new
#> # A tibble: 10 x 4
#> V1 V2 V3 V4
#> <chr> <chr> <chr> <chr>
#> 1 "" X1 X2 X3
#> 2 A 100 600 500
#> 3 A 200 300 300
#> 4 A 250 400 200
#> 5 M 200 300 200
#> 6 M 230 550 200
#> 7 M 400 750 100
#> 8 U 400 800 500
#> 9 U 100 900 400
#> 10 U 200 540 600
# Clean 'mydata_new'
## Add column names
colnames(mydata_new) <- c('Group', 'X1', 'X2', 'X3')
## Remove first row
mydata_new <- mydata_new[-1, ]
# View cleaned 'mydata_new'
mydata_new
#> # A tibble: 9 x 4
#> Group X1 X2 X3
#> <chr> <chr> <chr> <chr>
#> 1 A 100 600 500
#> 2 A 200 300 300
#> 3 A 250 400 200
#> 4 M 200 300 200
#> 5 M 230 550 200
#> 6 M 400 750 100
#> 7 U 400 800 500
#> 8 U 100 900 400
#> 9 U 200 540 600
Now summarise the data.
# Summarise numeric data
mydata_new %>%
# Convert all data columns from 'character' to 'numeric'
mutate_at(vars(starts_with('X')),
as.numeric) %>%
# Group data by the grouping variable before summarising
group_by(Group) %>%
# Calculate MEAN and SD for each data column
summarise_at(vars(starts_with('X')),
funs(MEAN = mean, SD = sd))
#> # A tibble: 3 x 7
#> Group X1_MEAN X2_MEAN X3_MEAN X1_SD X2_SD X3_SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 183. 433. 333. 76.4 153. 153.
#> 2 M 277. 533. 167. 108. 225. 57.7
#> 3 U 233. 747. 500 153. 186. 100
Update: 10 May 2018 following query about adding the coefficient of variation.
Coefficient of variation is not a base R function, so create a user-defined function.
# Define function: (cv = sd / mean)
coef_var = function(x) {
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
}
Re-execute the summary with added summary functions
# Execute summary
mydata_new %>%
# Convert all data columns from 'character' to 'numeric'
mutate_at(vars(starts_with('X')),
as.numeric) %>%
# Group data by the grouping variable before summarising
group_by(Group) %>%
# Calculate summaries each data column
## Call the summary functions with a dummy "." argument so that
## Additional arguments can be added to the called functions
## (e.g., adding na.rm = TRUE to cope with missing data)
## See ?dplyr::funs for details
summarise_at(vars(starts_with('X')),
funs(MEAN = mean(., na.rm = TRUE), # Mean
SD = sd(., na.rm = TRUE), # SD
CV = coef_var, # Coefficient of variation
# Add other summary stats as needed
MEDIAN = median(., na.rm = TRUE), # Median
Q25 = quantile(., prob = 0.25, na.rm = TRUE), # 25th percentile
Q75 = quantile(., prob = 0.75, na.rm = TRUE), # 75th percentile
min = min(., na.rm = TRUE), # Minimum
max = max(., na.rm = TRUE))) # Maximum
#> # A tibble: 3 x 25
#> Group X1_MEAN X2_MEAN X3_MEAN X1_SD X2_SD X3_SD X1_CV X2_CV X3_CV
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 183. 433. 333. 76.4 153. 153. 0.417 0.353 0.458
#> 2 M 277. 533. 167. 108. 225. 57.7 0.390 0.423 0.346
#> 3 U 233. 747. 500 153. 186. 100 0.655 0.249 0.2
#> # ... with 15 more variables: X1_MEDIAN <dbl>, X2_MEDIAN <dbl>,
#> # X3_MEDIAN <dbl>, X1_Q25 <dbl>, X2_Q25 <dbl>, X3_Q25 <dbl>,
#> # X1_Q75 <dbl>, X2_Q75 <dbl>, X3_Q75 <dbl>, X1_min <dbl>, X2_min <dbl>,
#> # X3_min <dbl>, X1_max <dbl>, X2_max <dbl>, X3_max <dbl>
Created on 2018-05-10 by the reprex package (v0.2.0).
来源:https://stackoverflow.com/questions/41529451/descriptive-statistics-in-r