Calculate within and between variances and confidence intervals in R

后端 未结 4 1581
误落风尘
误落风尘 2021-02-06 09:13

I need to calculate the within and between run variances from some data as part of developing a new analytical chemistry method. I also need confidence intervals from this data

相关标签:
4条回答
  • 2021-02-06 09:43

    I've been looking at a similar problem. I've found reference to caluclating confidence intervals by Burdick and Graybill (Burdick, R. and Graybill, F. 1992, Confidence Intervals on variance components, CRC Press)

    Using some code I've been trying I get these values

    
    
    > kiaraov = aov(Value~Run+Error(Run),data=kiar)
    
    > summary(kiaraov)
    
    Error: Run
        Df  Sum Sq Mean Sq
    Run  3 2.57583 0.85861
    
    Error: Within
              Df  Sum Sq Mean Sq F value Pr(>F)
    Residuals  8 1.93833 0.24229               
    > confint = 95
    
    > a = (1-(confint/100))/2
    
    > grandmean = as.vector(kiaraov$"(Intercept)"[[1]][1]) # Grand Mean (I think)
    
    > within = summary(kiaraov)$"Error: Within"[[1]]$"Mean Sq"  # S2^2Mean Square Value for Within Run
    
    > dfRun = summary(kiaraov)$"Error: Run"[[1]]$"Df"
    
    > dfWithin = summary(kiaraov)$"Error: Within"[[1]]$"Df"
    
    > Run = summary(kiaraov)$"Error: Run"[[1]]$"Mean Sq" # S1^2Mean Square for between Run
    
    > between = (Run-within)/((dfWithin/(dfRun+1))+1) # (S1^2-S2^2)/J
    
    > total = between+within
    
    > between # Between Run Variance
    [1] 0.2054398
    
    > within # Within Run Variance
    [1] 0.2422917
    
    > total # Total Variance
    [1] 0.4477315
    
    > betweenCV = sqrt(between)/grandmean * 100 # Between Run CV%
    
    > withinCV = sqrt(within)/grandmean * 100 # Within Run CV%
    
    > totalCV = sqrt(total)/grandmean * 100 # Total CV%
    
    > #within confidence intervals
    
    > withinLCB = within/qf(1-a,8,Inf) # Within LCB
    
    > withinUCB = within/qf(a,8,Inf) # Within UCB
    
    > #Between Confidence Intervals
    
    > n1 = dfRun
    
    > n2 = dfWithin
    
    > G1 = 1-(1/qf(1-a,n1,Inf)) # According to Burdick and Graybill this should be a
    
    > G2 = 1-(1/qf(1-a,n2,Inf))
    
    > H1 = (1/qf(a,n1,Inf))-1  # and this should be 1-a, but my results don't agree
    
    > H2 = (1/qf(a,n2,Inf))-1
    
    > G12 = ((qf(1-a,n1,n2)-1)^2-(G1^2*qf(1-a,n1,n2)^2)-(H2^2))/qf(1-a,n1,n2) # again, should be a, not 1-a
    
    > H12 = ((1-qf(a,n1,n2))^2-H1^2*qf(a,n1,n2)^2-G2^2)/qf(a,n1,n2) # again, should be 1-a, not a
    
    > Vu = H1^2*Run^2+G2^2*within^2+H12*Run*within
    
    > Vl = G1^2*Run^2+H2^2*within^2+G12*within*Run
    
    > betweenLCB = (Run-within-sqrt(Vl))/J # Betwen LCB
    
    > betweenUCB = (Run-within+sqrt(Vu))/J # Between UCB
    
    > #Total Confidence Intervals
    
    > y = (Run+(J-1)*within)/J
    
    > totalLCB = y-(sqrt(G1^2*Run^2+G2^2*(J-1)^2*within^2)/J) # Total LCB
    
    > totalUCB = y+(sqrt(H1^2*Run^2+H2^2*(J-1)^2*within^2)/J) # Total UCB
    
    > result = data.frame(Name=c("within", "between", "total"),CV=c(withinCV,betweenCV,totalCV),LCB=c(sqrt(withinLCB)/grandmean*100,sqrt(betweenLCB)/grandmean*100,sqrt(totalLCB)/grandmean*100),UCB=c(sqrt(withinUCB)/grandmean*100,sqrt(betweenUCB)/grandmean*100,sqrt(totalUCB)/grandmean*100))
    
    > result
         Name       CV      LCB      UCB
    1  within 4.926418 3.327584  9.43789
    2 between 4.536327      NaN 19.73568
    3   total 6.696855 4.846030 20.42647
    

    Here the lower confidence interval for between run CV is less than zero, so reported as NaN.

    I'd love to have a better way to do this. If I get time I might try to create a function to do this.

    Paul.

    --

    Edit: I did eventually write a function, here it is (caveat emptor)

    #' avar Function
    #' 
    #' Calculate thewithin, between and total %CV of a dataset by ANOVA, and the
    #' associated confidence intervals
    #' 
    #' @param dataf - The data frame to use, in long format 
    #' @param afactor Character string representing the column in dataf that contains the factor
    #' @param aresponse  Charactyer string representing the column in dataf that contains the response value
    #' @param aconfidence What Confidence limits to use, default = 95%
    #' @param digits  Significant Digits to report to, default = 3
    #' @param debug Boolean, Should debug messages be displayed, default=FALSE
    #' @returnType dataframe containing the Mean, Within, Between and Total %CV and LCB and UCB for each
    #' @return 
    #' @author Paul Hurley
    #' @export
    #' @examples 
    #' #Using the BGBottles data from Burdick and Graybill Page 62
    #' assayvar(dataf=BGBottles, afactor="Machine", aresponse="weight")
    avar<-function(dataf, afactor, aresponse, aconfidence=95, digits=3, debug=FALSE){
        dataf<-subset(dataf,!is.na(with(dataf,get(aresponse))))
        nmissing<-function(x) sum(!is.na(x))
        n<-nrow(subset(dataf,is.numeric(with(dataf,get(aresponse)))))
        datadesc<-ddply(dataf, afactor, colwise(nmissing,aresponse))
        I<-nrow(datadesc)
        if(debug){print(datadesc)}
        if(min(datadesc[,2])==max(datadesc[,2])){
            balance<-TRUE
            J<-min(datadesc[,2])
            if(debug){message(paste("Dataset is balanced, J=",J,"I is ",I,sep=""))}
        } else {
            balance<-FALSE
            Jh<-I/(sum(1/datadesc[,2], na.rm = TRUE))
            J<-Jh
            m<-min(datadesc[,2])
            M<-max(datadesc[,2])
            if(debug){message(paste("Dataset is unbalanced, like me, I is ",I,sep=""))}
            if(debug){message(paste("Jh is ",Jh, ", m is ",m, ", M is ",M, sep=""))}
        }
        if(debug){message(paste("Call afactor=",afactor,", aresponse=",aresponse,sep=""))}
        formulatext<-paste(as.character(aresponse)," ~ 1 + Error(",as.character(afactor),")",sep="")
        if(debug){message(paste("formula text is ",formulatext,sep=""))}
        aovformula<-formula(formulatext)
        if(debug){message(paste("Formula is ",as.character(aovformula),sep=""))}
        assayaov<-aov(formula=aovformula,data=dataf)
        if(debug){
            print(assayaov)
            print(summary(assayaov))
        }
        a<-1-((1-(aconfidence/100))/2)
        if(debug){message(paste("confidence is ",aconfidence,", alpha is ",a,sep=""))}
        grandmean<-as.vector(assayaov$"(Intercept)"[[1]][1]) # Grand Mean (I think)
        if(debug){message(paste("n is",n,sep=""))}
    
        #This line commented out, seems to choke with an aov object built from an external formula
        #grandmean<-as.vector(model.tables(assayaov,type="means")[[1]]$`Grand mean`) # Grand Mean (I think)
        within<-summary(assayaov)[[2]][[1]]$"Mean Sq"  # d2e, S2^2 Mean Square Value for Within Machine = 0.1819
        dfRun<-summary(assayaov)[[1]][[1]]$"Df"  # DF for within = 3
        dfWithin<-summary(assayaov)[[2]][[1]]$"Df"  # DF for within = 8
        Run<-summary(assayaov)[[1]][[1]]$"Mean Sq" # S1^2Mean Square for Machine
        if(debug){message(paste("mean square for Run ?",Run,sep=""))}
        #Was between<-(Run-within)/((dfWithin/(dfRun+1))+1) but my comment suggests this should be just J, so I'll use J !
        between<-(Run-within)/J # d2a (S1^2-S2^2)/J
        if(debug){message(paste("S1^2 mean square machine is ",Run,", S2^2 mean square within is ",within))}
        total<-between+within
        between # Between Run Variance
        within # Within Run Variance
        total # Total Variance
        if(debug){message(paste("between is ",between,", within is ",within,", Total is ",total,sep=""))}
    
        betweenCV<-sqrt(between)/grandmean * 100 # Between Run CV%
        withinCV<-sqrt(within)/grandmean * 100 # Within Run CV%
        totalCV<-sqrt(total)/grandmean * 100 # Total CV%
        n1<-dfRun
        n2<-dfWithin
        if(debug){message(paste("n1 is ",n1,", n2 is ",n2,sep=""))}
        #within confidence intervals
        if(balance){
            withinLCB<-within/qf(a,n2,Inf) # Within LCB
            withinUCB<-within/qf(1-a,n2,Inf) # Within UCB
        } else {
            withinLCB<-within/qf(a,n2,Inf) # Within LCB
            withinUCB<-within/qf(1-a,n2,Inf) # Within UCB
        }
    #Mean Confidence Intervals
        if(debug){message(paste(grandmean,"+/-(sqrt(",Run,"/",n,")*qt(",a,",df=",I-1,"))",sep=""))} 
        meanLCB<-grandmean+(sqrt(Run/n)*qt(1-a,df=I-1)) # wrong
        meanUCB<-grandmean-(sqrt(Run/n)*qt(1-a,df=I-1)) # wrong
        if(debug){message(paste("Grandmean is ",grandmean,", meanLCB = ",meanLCB,", meanUCB = ",meanUCB,aresponse,sep=""))}
        if(debug){print(summary(assayaov))}
    #Between Confidence Intervals
        G1<-1-(1/qf(a,n1,Inf)) 
        G2<-1-(1/qf(a,n2,Inf))
        H1<-(1/qf(1-a,n1,Inf))-1  
        H2<-(1/qf(1-a,n2,Inf))-1
        G12<-((qf(a,n1,n2)-1)^2-(G1^2*qf(a,n1,n2)^2)-(H2^2))/qf(a,n1,n2) 
        H12<-((1-qf(1-a,n1,n2))^2-H1^2*qf(1-a,n1,n2)^2-G2^2)/qf(1-a,n1,n2) 
        if(debug){message(paste("G1 is ",G1,", G2 is ",G2,sep=""))
            message(paste("H1 is ",H1,", H2 is ",H2,sep=""))
            message(paste("G12 is ",G12,", H12 is ",H12,sep=""))
        }
        if(balance){
            Vu<-H1^2*Run^2+G2^2*within^2+H12*Run*within
            Vl<-G1^2*Run^2+H2^2*within^2+G12*within*Run
            betweenLCB<-(Run-within-sqrt(Vl))/J # Betwen LCB
            betweenUCB<-(Run-within+sqrt(Vu))/J # Between UCB
        } else {
            #Burdick and Graybill seem to suggest calculating anova of mean values to find n1S12u/Jh
            meandataf<-ddply(.data=dataf,.variable=afactor, .fun=function(df){mean(with(df, get(aresponse)), na.rm=TRUE)})
            meandataaov<-aov(formula(paste("V1~",afactor,sep="")), data=meandataf)
            sumsquare<-summary(meandataaov)[[1]]$`Sum Sq`
            #so maybe S12u is just that bit ?
            Runu<-(sumsquare*Jh)/n1
            if(debug){message(paste("n1S12u/Jh is ",sumsquare,", so S12u is ",Runu,sep=""))}
            Vu<-H1^2*Runu^2+G2^2*within^2+H12*Runu*within
            Vl<-G1^2*Runu^2+H2^2*within^2+G12*within*Runu
            betweenLCB<-(Runu-within-sqrt(Vl))/Jh # Betwen LCB
            betweenUCB<-(Runu-within+sqrt(Vu))/Jh # Between UCB
            if(debug){message(paste("betweenLCB is ",betweenLCB,", between UCB is ",betweenUCB,sep=""))}
        }
    #Total Confidence Intervals
        if(balance){
            y<-(Run+(J-1)*within)/J
            if(debug){message(paste("y is ",y,sep=""))}
            totalLCB<-y-(sqrt(G1^2*Run^2+G2^2*(J-1)^2*within^2)/J) # Total LCB
            totalUCB<-y+(sqrt(H1^2*Run^2+H2^2*(J-1)^2*within^2)/J) # Total UCB
        } else {
            y<-(Runu+(Jh-1)*within)/Jh
            if(debug){message(paste("y is ",y,sep=""))}
            totalLCB<-y-(sqrt(G1^2*Runu^2+G2^2*(Jh-1)^2*within^2)/Jh) # Total LCB
            totalUCB<-y+(sqrt(H1^2*Runu^2+H2^2*(Jh-1)^2*within^2)/Jh) # Total UCB
        }
        if(debug){message(paste("totalLCB is ",totalLCB,", total UCB is ",totalUCB,sep=""))}
    #   result<-data.frame(Name=c("within", "between", "total"),CV=c(withinCV,betweenCV,totalCV),
    #           LCB=c(sqrt(withinLCB)/grandmean*100,sqrt(betweenLCB)/grandmean*100,sqrt(totalLCB)/grandmean*100),
    #           UCB=c(sqrt(withinUCB)/grandmean*100,sqrt(betweenUCB)/grandmean*100,sqrt(totalUCB)/grandmean*100))
        result<-data.frame(Mean=grandmean,MeanLCB=meanLCB, MeanUCB=meanUCB, Within=withinCV,WithinLCB=sqrt(withinLCB)/grandmean*100, WithinUCB=sqrt(withinUCB)/grandmean*100,
                Between=betweenCV, BetweenLCB=sqrt(betweenLCB)/grandmean*100, BetweenUCB=sqrt(betweenUCB)/grandmean*100,
                Total=totalCV, TotalLCB=sqrt(totalLCB)/grandmean*100, TotalUCB=sqrt(totalUCB)/grandmean*100)
        if(!digits=="NA"){
            result$Mean<-signif(result$Mean,digits=digits)
            result$MeanLCB<-signif(result$MeanLCB,digits=digits)
            result$MeanUCB<-signif(result$MeanUCB,digits=digits)
            result$Within<-signif(result$Within,digits=digits)
            result$WithinLCB<-signif(result$WithinLCB,digits=digits)
            result$WithinUCB<-signif(result$WithinUCB,digits=digits)
            result$Between<-signif(result$Between,digits=digits)
            result$BetweenLCB<-signif(result$BetweenLCB,digits=digits)
            result$BetweenUCB<-signif(result$BetweenUCB,digits=digits)
            result$Total<-signif(result$Total,digits=digits)
            result$TotalLCB<-signif(result$TotalLCB,digits=digits)
            result$TotalUCB<-signif(result$TotalUCB,digits=digits)
        }
        return(result)
    }
    
    assayvar<-function(adata, aresponse, afactor, anominal, aconfidence=95, digits=3, debug=FALSE){
        result<-ddply(adata,anominal,function(df){
                    resul<-avar(dataf=df,afactor=afactor,aresponse=aresponse,aconfidence=aconfidence, digits=digits, debug=debug)
                    resul$n<-nrow(subset(df, !is.na(with(df, get(aresponse)))))
                    return(resul)
                })
        return(result)
    }
    
    0 讨论(0)
  • 2021-02-06 09:46

    You have four groups of three observations:

    > run1 = c(9.85, 9.95, 10.00)
    > run2 = c(9.90, 8.80, 9.50)
    > run3 = c(11.20, 11.10, 9.80)
    > run4 = c(9.70, 10.10, 10.00)
    > runs = c(run1, run2, run3, run4)
    > runs
     [1]  9.85  9.95 10.00  9.90  8.80  9.50 11.20 11.10  9.80  9.70 10.10 10.00
    

    Make some labels:

    > n = rep(3, 4)
    > group = rep(1:4, n)
    > group
     [1] 1 1 1 2 2 2 3 3 3 4 4 4
    

    Calculate within-run stats:

    > withinRunStats = function(x) c(sum = sum(x), mean = mean(x), var = var(x), n = length(x))
    > tapply(runs, group, withinRunStats)
    $`1`
             sum         mean          var            n 
    29.800000000  9.933333333  0.005833333  3.000000000 
    
    $`2`
      sum  mean   var     n 
    28.20  9.40  0.31  3.00 
    
    $`3`
      sum  mean   var     n 
    32.10 10.70  0.61  3.00 
    
    $`4`
            sum        mean         var           n 
    29.80000000  9.93333333  0.04333333  3.00000000 
    

    You can do some ANOVA here:

    > data = data.frame(y = runs, group = factor(group))
    > data
           y group
    1   9.85     1
    2   9.95     1
    3  10.00     1
    4   9.90     2
    5   8.80     2
    6   9.50     2
    7  11.20     3
    8  11.10     3
    9   9.80     3
    10  9.70     4
    11 10.10     4
    12 10.00     4
    
    > fit = lm(runs ~ group, data)
    > fit
    
    Call:
    lm(formula = runs ~ group, data = data)
    
    Coefficients:
    (Intercept)       group2       group3       group4  
      9.933e+00   -5.333e-01    7.667e-01   -2.448e-15 
    
    > anova(fit)
    Analysis of Variance Table
    
    Response: runs
              Df  Sum Sq Mean Sq F value  Pr(>F)  
    group      3 2.57583 0.85861  3.5437 0.06769 .
    Residuals  8 1.93833 0.24229                  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    > degreesOfFreedom = anova(fit)[, "Df"]
    > names(degreesOfFreedom) = c("treatment", "error")
    > degreesOfFreedom
    treatment     error 
            3         8
    

    Error or within-group variance:

    > anova(fit)["Residuals", "Mean Sq"]
    [1] 0.2422917
    

    Treatment or between-group variance:

    > anova(fit)["group", "Mean Sq"]
    [1] 0.8586111
    

    This should give you enough confidence to do confidence intervals.

    0 讨论(0)
  • 2021-02-06 09:59

    If you want to apply a function (such as var) across a factor such as Run or Rep, you can use tapply:

    > with(variance, tapply(Value, Run, var))
              1           2           3           4 
    0.005833333 0.310000000 0.610000000 0.043333333 
    > with(variance, tapply(Value, Rep, var))
              1          2          3 
    0.48562500 0.88729167 0.05583333 
    
    0 讨论(0)
  • 2021-02-06 10:01

    I'm going to take a crack at this when I have more time, but meanwhile, here's the dput() for Kiar's data structure:

    structure(list(Run = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4), Rep = c(1, 
    2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3), Value = c(9.85, 9.95, 10, 9.9, 
    8.8, 9.5, 11.2, 11.1, 9.8, 9.7, 10.1, 10)), .Names = c("Run", 
    "Rep", "Value"), row.names = c(NA, -12L), class = "data.frame")
    

    ... in case you'd like to take a quick shot at it.

    0 讨论(0)
提交回复
热议问题