Tukey-Test Grouping and plotting in SciPy

我与影子孤独终老i 提交于 2019-12-10 12:25:35

问题


I'm trying to plot results from a Tukey test, but I am struggling with putting data into groups based on a P-Value. This is the equivalent in R which I am trying to replicate. I have been using the SciPy one-way ANOVA tests and the Tukey test statsmodel but can't get these groups done in the same way.

Any help is greatly appreciated

I've also just found this another example in R of what I want to do in python


回答1:


I have been struggling to do the same thing. I found a paper that tells you how to code the letters.
Hans-Peter Piepho (2004) An Algorithm for a Letter-Based Representation of All-Pairwise Comparisons, Journal of Computational and Graphical Statistics, 13:2, 456-466, DOI: 10.1198/1061860043515

Doing the coding was a little tricky as you need to check and replicate columns and then combine columns. I tried to add some comments to the colde. I figured out a method where you can run tukeyhsd and then from the results compute the letters. It should be possible to turn this into a function. Or hopefully part of tukeyhsd. My data is not posted but it is a column of data and then a column describing the groups. The groups for me are the five boroughs of NYC. You can also just change the comments and use random data the first time.

# Read data.  Comment out the next ones to use random data.  
df=pd.read_excel('anova_test.xlsx')
#n=1000
#df = pd.DataFrame(columns=['Groups','Data'],index=np.arange(n))
#df['Groups']=np.random.randint(1, 4,size=n)
#df['Data']=df['Groups']*np.random.random_sample(size=n)


# define columns for data and then grouping
col_to_group='Groups'
col_for_data='Data'

#Now take teh data and regroup for anova
samples = [cols[1] for cols in df.groupby(col_to_group)[col_for_data]]    #I am not sure how this works but it makes an numpy array for each group 
f_val, p_val = stats.f_oneway(*samples)  # I am not sure what this star does but this passes all the numpy arrays correctly
#print('F value: {:.3f}, p value: {:.3f}\n'.format(f_val, p_val))

# this if statement can be uncommmented if you don't won't to go furhter with out p<0.05
#if p_val<0.05:    #If the p value is less than 0.05 it then does the tukey
mod = MultiComparison(df[col_for_data], df[col_to_group])
thsd=mod.tukeyhsd()
#print(mod.tukeyhsd())

#this is a function to do Piepho method.  AN Alogrithm for a letter based representation of al-pairwise comparisons.  
tot=len(thsd.groupsunique)
#make an empty dataframe that is a square matrix of size of the groups. #set first column to 1
df_ltr=pd.DataFrame(np.nan, index=np.arange(tot),columns=np.arange(tot))
df_ltr.iloc[:,0]=1
count=0
df_nms = pd.DataFrame('', index=np.arange(tot), columns=['names'])  # I make a dummy dataframe to put axis labels into.  sd stands for signifcant difference

for i in np.arange(tot):   #I loop through and make all pairwise comparisons. 
    for j in np.arange(i+1,tot):
        #print('i=',i,'j=',j,thsd.reject[count])
        if thsd.reject[count]==True:
            for cn in np.arange(tot):
                if df_ltr.iloc[i,cn]==1 and df_ltr.iloc[j,cn]==1: #If the column contains both i and j shift and duplicat
                    df_ltr=pd.concat([df_ltr.iloc[:,:cn+1],df_ltr.iloc[:,cn+1:].T.shift().T],axis=1)
                    df_ltr.iloc[:,cn+1]=df_ltr.iloc[:,cn]
                    df_ltr.iloc[i,cn]=0
                    df_ltr.iloc[j,cn+1]=0
                #Now we need to check all columns for abosortpion.
                for cleft in np.arange(len(df_ltr.columns)-1):
                    for cright in np.arange(cleft+1,len(df_ltr.columns)):
                        if (df_ltr.iloc[:,cleft].isna()).all()==False and (df_ltr.iloc[:,cright].isna()).all()==False: 
                            if (df_ltr.iloc[:,cleft]>=df_ltr.iloc[:,cright]).all()==True:  
                                df_ltr.iloc[:,cright]=0
                                df_ltr=pd.concat([df_ltr.iloc[:,:cright],df_ltr.iloc[:,cright:].T.shift(-1).T],axis=1)
                            if (df_ltr.iloc[:,cleft]<=df_ltr.iloc[:,cright]).all()==True:
                                df_ltr.iloc[:,cleft]=0
                                df_ltr=pd.concat([df_ltr.iloc[:,:cleft],df_ltr.iloc[:,cleft:].T.shift(-1).T],axis=1)

        count+=1

#I sort so that the first column becomes A        
df_ltr=df_ltr.sort_values(by=list(df_ltr.columns),axis=1,ascending=False)

# I assign letters to each column
for cn in np.arange(len(df_ltr.columns)):
    df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(1,chr(97+cn)) 
    df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(0,'')
    df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(np.nan,'') 

#I put all the letters into one string
df_ltr=df_ltr.astype(str)
df_ltr.sum(axis=1)
#print(df_ltr)
#print('\n')
#print(df_ltr.sum(axis=1))

#Now to plot like R with a violing plot
fig,ax=plt.subplots()
df.boxplot(column=col_for_data, by=col_to_group,ax=ax,fontsize=16,showmeans=True
                    ,boxprops=dict(linewidth=2.0),whiskerprops=dict(linewidth=2.0))  #This makes the boxplot

ax.set_ylim([-10,20])

grps=pd.unique(df[col_to_group].values)   #Finds the group names
grps.sort() # This is critical!  Puts the groups in alphabeical order to make it match the plotting

props=dict(facecolor='white',alpha=1)
for i,grp in enumerate(grps):   #I loop through the groups to make the scatters and figure out the axis labels. 

    x = np.random.normal(i+1, 0.15, size=len(df[df[col_to_group]==grp][col_for_data]))
    ax.scatter(x,df[df[col_to_group]==grp][col_for_data],alpha=0.5,s=2)
    name="{}\navg={:0.2f}\n(n={})".format(grp
                            ,df[df[col_to_group]==grp][col_for_data].mean()
                            ,df[df[col_to_group]==grp][col_for_data].count())
    df_nms['names'][i]=name 
    ax.text(i+1,ax.get_ylim()[1]*1.1,df_ltr.sum(axis=1)[i],fontsize=10,verticalalignment='top',horizontalalignment='center',bbox=props)


ax.set_xticklabels(df_nms['names'],rotation=0,fontsize=10)
ax.set_title('')
fig.suptitle('')

fig.savefig('anovatest.jpg',dpi=600,bbox_inches='tight')

Results showing the letters above plots using the tukeyhsd



来源:https://stackoverflow.com/questions/43987651/tukey-test-grouping-and-plotting-in-scipy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!