If I want to calculate the mean of two categories in Pandas, I can do it like this:
data = {'Category': ['cat2','cat1','cat2','cat1','cat2','cat1','cat2','cat1','cat1','cat1','cat2'],
'values': [1,2,3,1,2,3,1,2,3,5,1]}
my_data = DataFrame(data)
my_data.groupby('Category').mean()
Category: values:
cat1 2.666667
cat2 1.600000
I have a lot of data formatted this way, and now I need to do a T-test to see if the mean of cat1 and cat2 are statistically different. How can I do that?
it depends what sort of t-test you want to do (one sided or two sided dependent or independent) but it should be as simple as:
from scipy.stats import ttest_ind
cat1 = my_data[my_data['Category']=='cat1']
cat2 = my_data[my_data['Category']=='cat2']
ttest_ind(cat1['values'], cat2['values'])
>>> (1.4927289925706944, 0.16970867501294376)
it returns a tuple with the t-statistic & the p-value
see here for other t-tests http://docs.scipy.org/doc/scipy/reference/stats.html
EDIT: I had not realized this was about the data format. You could use
two_data = pd.DataFrame(data, index=data['Category'])
Then accessing the categories is as simple as
scipy.stats.ttest_ind(two_data.loc['cat'], two_data.loc['cat2'], equal_var=False)
The loc operator
accesses rows by label.
one sided or two sided dependent or independent
If you have two independent samples but you do not know that they have equal variance, you can use Welch's t-test. It is as simple as
scipy.stats.ttest_ind(cat1['values'], cat2['values'], equal_var=False)
For reasons to prefer Welch's test, see https://stats.stackexchange.com/questions/305/when-conducting-a-t-test-why-would-one-prefer-to-assume-or-test-for-equal-vari.
For two dependent samples, you can use
scipy.stats.ttest_rel(cat1['values'], cat2['values'])
I simplify the code a little bit.
from scipy.stats import ttest_ind
ttest_ind(*my_data.groupby('Category')['value'].apply(lambda x:list(x)))
来源:https://stackoverflow.com/questions/13404468/t-test-in-pandas