I am stuck at performing t.tests for multiple categories in Rstudio. I want to have the results of the t.test of each product type, comparing the online and offline prices. I have over 800 product types so that's why don't want to do it manually for each product group.
I have a dataframe (more than 2 million rows) named data that looks like:
> Product_type Price_Online Price_Offline
1 A 48 37
2 B 29 22
3 B 32 40
4 A 38 36
5 C 32 27
6 C 31 35
7 C 28 24
8 A 47 42
9 C 40 36
Ideally I want R to write the result of the t.test to another data frame called product_types:
> Product_type
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
800 ...
becomes:
> Product_type t df p-value interval mean of difference
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
800 ...
This is the formula if I had all product types in different dataframes:
t.test(Product_A$Price_Online, Product_A$Price_Offline, mu=0, alt="two.sided", paired = TRUE, conf.level = 0.99)
There must be an easier way to do this. Otherwise I need to make 800+ data frames and then perform the t test 800 times.
I tried things with lists & lapply but so far it doesn't work. I also tried t-Test on multiple columns: https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/
However, at the end he is still manually inserting male & female (for me over 800 categories).
One way to do it is to use by
:
result <- by(data, data$Product_type,
function(x) t.test(x$Price_Online, x$Price_offline, mu=0, alt="two.sided", paired = TRUE, conf.level = 0.99))
The only drawback is that by returns a list, and if you want your results in a dataframe, you have to convert it:
df <- data.frame(t(matrix(unlist(result), nrow = 10)))
You'll then have to add the product type and column names manually:
df$Product_type <- names(result)
names(df) <- names(result$A)
The tidy way of doing it is using dplyr and broom:
library(dplyr)
library(broom)
df <- data %>%
group_by(Product_type) %>%
do(tidy(t.test(.$Price_Online,
.$Price_Offline,
mu = 0,
alt = "two.sided",
paired = TRUE,
conf.level = 0.99))))
Much more readable than my base r solution, and it handles the column names for you!
EDIT
A more idiomatic way to do it rather than using do
(see r4ds) is to use nest
to create nested dataframes for each product type, then run a t-test for each nested dataframe using map
from purrr
.
library(broom)
library(dplyr)
library(purrr)
library(tidyr)
t_test <- function(df, mu = 0, alt = "two.sided", paired = T, conf.level = .99) {
tidy(t.test(df$Price_Offline,
df$Price_Online,
mu = mu,
alt = alt,
paired = paired,
conf.level = conf.level))
}
d <- df %>%
group_by(Product_type) %>%
nest() %>%
mutate(ttest = map(data, t_test)) %>%
unnest(ttest, .drop = T)
来源:https://stackoverflow.com/questions/42609694/perform-multiple-paired-t-tests-based-on-groups-categories