Is there an equivalent function for anova.lm() in Java?

China☆狼群 提交于 2019-12-18 09:15:08

问题


I am comparing two linear models in R with Anova, and I would like to do the same thing in Java. To simplify it, I took the example code from https://stats.stackexchange.com/questions/48854/why-am-i-getting-different-intercept-values-in-r-and-java-for-simple-linear-regr and modified it a bit below. The models are test_trait ~ geno_A + geno_B and test_trait ~ geno_A + geno_B + geno_A:geno_B. The coefficients of the models implemented in R and Java are the same. In R I use anova(fit, fit2) where the fits are the results of lm and in Java I use TestUtils.oneWayAnovaPValue from org.apache.commons.math3.

With R I get a pvalue of 0.797, while with Java I get a pvalue of 0.817, so this is not the right method, but I can not find how to do it correctly. Is there an equivalent of R's anova.lm in Java?

The full code is below.

R

test_trait <- c( -0.48812477 , 0.33458213, -0.52754476, -0.79863471, -0.68544309, -0.12970239,  0.02355622, -0.31890850,0.34725819 , 0.08108851)
geno_A <- c(1, 0, 1, 2, 0, 0, 1, 0, 1, 0)
geno_B <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0) 
fit <- lm(test_trait ~ geno_A+geno_B)
fit2 <- lm(test_trait ~ geno_A + geno_B + geno_A:geno_B)

which gives the coefficients

> fit
Call:
lm(formula = test_trait ~ geno_A + geno_B)

Coefficients:
(Intercept)       geno_A       geno_B  
   -0.03233     -0.10479     -0.60492  

> fit2
Call:
lm(formula = test_trait ~ geno_A + geno_B + geno_A:geno_B)

Coefficients:
  (Intercept)         geno_A         geno_B  geno_A:geno_B  
    -0.008235      -0.152979      -0.677208       0.096383  

And the Anova

> anova(fit, fit2) # 0.797 
Analysis of Variance Table

Model 1: test_trait ~ geno_A + geno_B
Model 2: test_trait ~ geno_A + geno_B + geno_A:geno_B
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1      7 0.77982                           
2      6 0.77053  1 0.0092897 0.0723  0.797

Java

    double [] y =  {-0.48812477,  0.33458213,  
            -0.52754476, -0.79863471,
            -0.68544309, -0.12970239,
             0.02355622, -0.31890850,
             0.34725819,  0.08108851};
double [][] x = {{1,0}, {0,0},
                 {1,0}, {2,1},
                 {0,1}, {0,0},
                 {1,0}, {0,0},
                 {1,0}, {0,0}};
double [][] xb = {{1,0,0}, {0,0,0},
                  {1,0,0}, {2,1,2},
                  {0,1,0}, {0,0,0},
                  {1,0,0}, {0,0,0},
                  {1,0,0}, {0,0,0}};

OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
regr.newSampleData(y, x);
double[] beta = regr.estimateRegressionParameters();   

System.out.printf("First model: y = int + genoA + genoB\n");
System.out.printf("Intercept: %.3f\t", beta[0]);
System.out.printf("beta1: %.3f\t", beta[1]);
System.out.printf("beta2: %.3f\n\n", beta[2]);

regr.newSampleData(y, xb);
double[] betab = regr.estimateRegressionParameters();   

System.out.printf("Second model: y = int + genoA + genoB + genoA:genoB\n");
System.out.printf("Intercept: %.3f\t", betab[0]);
System.out.printf("beta1: %.3f\t", betab[1]);
System.out.printf("beta2: %.3f\t", betab[2]);
System.out.printf("beta2: %.3f\n", betab[3]);

Which gives the same coefficients as in R

First model: y = int + genoA + genoB
Intercept: -0.032   beta1: -0.105   beta2: -0.605

Second model: y = int + genoA + genoB + genoA:genoB
Intercept: -0.008   beta1: -0.153   beta2: -0.677   beta2: 0.096

But the Anova gives a different result

List classes = new ArrayList();
classes.add(beta);
classes.add(betab);
double pvalue = TestUtils.oneWayAnovaPValue(classes);
double fvalue = TestUtils.oneWayAnovaFValue(classes);
System.out.println(pvalue); 
System.out.println(fvalue); 

0.8165390406874127
0.05979444576790511

回答1:


You're very much misunderstanding ANOVA in the case where you compare two regressions. That's not an ANOVA in the sense of oneWayAnova. The equivalent of onewayAnova in R is the function aov. The function anova on the other hand implements a multitude of tests for comparison of models, and the name anova is confusing to say the least...

If you compare two regression models, you want to do an F test on sums of squares. What you do in your code, is an one-way ANOVA to see if the two sets of regression parameters differ significantly. That's not what you want to do, but that's exactly what your JAVA code is doing.

In order to calculate the correct F test, you need to do the following:

  1. calculate MSE for the largest model by dividing the Residual Sum of Squares (RSS) by the degrees of freedom (df) (in the R table : 0.77053 / 6

  2. calculate the MSEdifference by substracting the RSS of both models (result is "Sum of Sq." in the R table), substracting the df for both models (result is "Df" in the R table), and divide these numbers.

  3. Divide 2 by 1 and you have the F value

  4. calculate the p-value using the F value in 3 and for df the df-difference in the numerator and df of the largest model in the denominator.

As far as I know, the class OLSMultipleLinearRegression doesn't have any convenient methods to extract the number of degrees of freedom, so this is not straight-forward to do in Java. You'll have to calculate the df manually and then use the class FDistribution to calculate the p value.

eg:

OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
regr.newSampleData(y, x);
double SSR1 = regr.calculateResidualSumOfSquares();
double df1 = y.length - (x[0].length + 1); 
    //df = n - number of coefficients, including intercept

regr.newSampleData(y, xb);
double SSR2 = regr.calculateResidualSumOfSquares();
double df2 = y.length - (xb[0].length + 1);

double MSE = SSR2/df2; // EDIT: You need the biggest model here!
double MSEdiff = Math.abs ((SSR2 - SSR1) / (df2 - df1));
double dfdiff = Math.abs(df2 - df1);

double Fval = MSEdiff / MSE;

FDistribution Fdist = new FDistribution(dfdiff, df2);
double pval = 1 - Fdist.cumulativeProbability(Fval);

Now both the F value and the p value should be exactly what you see in the anova() table of R. df1 and df2 are the column Res.Df in the R table, the difference should be Df in the R table, and MSEdiff should be the same as Sum of Sq. divided by Df from the R table.

Disclaimer: I'm a bad JAVA programmer, so code above is more conceptual than actual code. Please look for typos or stupid mistakes and check the documentation of the FDistribution class I used here :

https://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/distribution/FDistribution.html#cumulativeProbability%28double%29

And now you know why statisticians use R instead of Java ;-)


EDIT : The FDistribution used in the code above is the class

org.apache.commons.math3.distribution.FDistribution

There's also an FDistribution in JSci :

JSci.maths.statistics.FDistribution

If you use that one, the last part of the code becomes:

FDistribution Fdist = new FDistribution(dfdiff, df2);
double pval = 1 - Fdist.cumulative(Fval);

Depending on the exact implementation, the cumulative probabilities might differ slightly. I have alas no idea about the difference and/or which one can be trusted better.




回答2:


The problem is that the methods you're comparing are not doing the same.

anova() in R actually performs a Likelihood-Ratio test to check if your second model significantly improves by adding a new variable: more info in the answer here

oneWayAnovaPValue() in java on the other hand simply runs a t-test to check if the difference in means between groups is significant. What you've done in this case is comparing if the mean of your first set of coefficients is significantly different from the second set, which is irrelevant.

As far as I know, there is no function readily available in java to readily perform a Likelihood-Ratio test. But you can create one quite easily. In R you can do the following

anova(fit, fit2,test="Chisq")
#p: 0.788

#or manually:
df.diff = fit$df.residual - fit2$df.residual
vals <- (sum(residuals(fit)^2) - sum(residuals(fit2)^2))/sum(residuals(fit2)^2) * fit2$df.residual 
pchisq(vals, df.diff, lower.tail = FALSE)
#p: 0.7879634

so you can take the same approach in java. A short search on google gives me an implementation of pchisq in java here (take note that the lower.tail=FALSE command is the same as 1-pchisq(lower.tail=TRUE), so we don't really need that option).

This allows us to do the following

public void regressionRun(){
OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
OLSMultipleLinearRegression regr2 = new OLSMultipleLinearRegression();

double[] y = new double[] { -0.48812477, 0.33458213, -0.52754476,
        -0.79863471, -0.68544309, -0.12970239, 0.02355622, -0.31890850,
        0.34725819, 0.08108851 };
double[][] x = new double[10][];
double[][] x2 = new double[10][];
x[0] = new double[] { 1, 0 };
x[1] = new double[] { 0, 0 };
x[2] = new double[] { 1, 0 };
x[3] = new double[] { 2, 1 };
x[4] = new double[] { 0, 1 };
x[5] = new double[] { 0, 0 };
x[6] = new double[] { 1, 0 };
x[7] = new double[] { 0, 0 };
x[8] = new double[] { 1, 0 };
x[9] = new double[] { 0, 0 };
//
x2[0] = new double[] { 1, 0, 0 };
x2[1] = new double[] { 0, 0, 0 };
x2[2] = new double[] { 1, 0, 0 };
x2[3] = new double[] { 2, 1, 2 };
x2[4] = new double[] { 0, 1, 0 };
x2[5] = new double[] { 0, 0, 0 };
x2[6] = new double[] { 1, 0, 0 };
x2[7] = new double[] { 0, 0, 0 };
x2[8] = new double[] { 1, 0, 0 };
x2[9] = new double[] { 0, 0, 0 };

regr.newSampleData(y, x);
double[] b = regr.estimateResiduals();

regr2.newSampleData(y, x2);
double[] b2 = regr2.estimateResiduals();

//calculate sum of squares
double sumsq_b = 0;
double sumsq_b2 = 0;
for (double res : b){
    sumsq_b += res**2;
}
for (double res : b2){
    sumsq_b2 += res**2;
}
//calculate degrees of freedom
int df_b = y.length-(x[0].length+1);
int df_b2 = y.length-(x2[0].length+1);

double vals = (sumsq_b-sumsq_b2)/sumsq_b2*df_b2;

double pvalue = 1-pchisq(vals,df_b-df_b2);
System.out.println(pvalue);
}
//0.7879633810167291


来源:https://stackoverflow.com/questions/35398918/is-there-an-equivalent-function-for-anova-lm-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!